The Good Tech Companies - MongoDB vs ScyllaDB: Architecture Comparison

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. MongoDB versus SkylidiB architecture comparison by Skyladyby. Benjant compares MongoDB and Skyladyby architectures, with a focus on what the differences mean for performance and scalability when choosing a NoSQL database, the options can be overwhelming. One of the most popular choices is MongoDB, known for its easy use. But the highly performance-oriented Skylidibi is one of the rising challengers. This Bench-Aunt report takes a closer technical look at both databases, comparing their architectures from an independent technical angle. Both MongoDB and Skylo-D-B promise a high-available,

Starting point is 00:00:42 performant and scalable architecture. But the way they achieve these objectives is much more different than you might think at first glance. For instance, an experience report demonstrates how Skyladyby can easily be operated on Oz EC2 spot instances thanks to its distributed architecture while MongoDB's distributed architecture would make this a very challenging task. To highlight these differences, we provide an in-depth discussion of the internal storage architecture and the distributed architectures enabling high availability and horizontal scalability. Note, we also just released a benchmark quantifying the impact of these differences. Read the DynamoDB versus MongoDB benchmark summary

Starting point is 00:01:20 download this comparison report a performance viewpoint on the storage architecture of MongoDB versus SkyloDB. Both databases are implemented in C++ and recommend the use of the XFS file system. Moreover, MongoDB and SkyloDB are building upon the write-ahead logging concept, commit log in SkyloG terminology and op-log in MongoDB terminology. With write-ahead logging, all operations are written to a log table before the operation is executed. The write-ahead log serves as a source to replicate the data to other nodes, and it is used to restore data in case of failures because it is possible to backquote replay backquote the operations to restore the data. MongoDB uses as default storage engine a B plus tree index, Wired Tiger, for data storage and

Starting point is 00:02:06 retrieval. B plus tree indexes are balanced tree data structures that store data in assorted order, making it easy to perform range-based queries. MongoDB supports multiple indexes on a collection, including compound indexes, text indexes, and geospatial indexes. Indexing of array elements and nested fields, allowing for efficient queries on complex data structures, are also possible. In addition, the enterprise version of MongoDB supports an in-memory storage engine for low-latency workloads. Skyladyb divides data into shards by assigning a fragment of the total data in anode to a specific CPU, along with its associated memory, RAM, and persistent storage, such as NVMI SSD. The internal storage engine of SkylaDB follows the right-ahead logging concept

Starting point is 00:02:54 by applying a disk-persistent commit log together with memory-based memtables that are flushed to disk over time. SkylaDB supports primary, secondary, and composite indexes, both local per node and global per cluster. The primary index consists of a hashing ring where the hashed key and the corresponding partition are stored. And within the partition, Skyla-Db finds the row in a sorted data structure, SS table, which is a variant of the LSM tree. The secondary index is maintained in an index table. When a secondary index is queried, SkyladyB first retrieves the partition key, which ICE associated with the secondary key, and afterward the data value for the secondary

Starting point is 00:03:34 key on the right partition. These different storage architectures result in a different usage of the available hardware to handle the workload. MongoDB does not pin internal threads to available CPU cores but applies an unbound approach to distributed threads to cores. With modern Numa-based CPU architectures, this can cause a performance degradation, especially for large servers because threads can dynamically be assigned to cores on different sockets with different memory nodes. In contrast, Skyladyby follows a shard per core approach that allows it to pin the responsible threads to specific cores and avoid

Starting point is 00:04:09 switching between different cores and memory spaces. In consequence, the shard key needs to be selected carefully tonure and equal data distribution across the shards and to prevent hot shards. Moreover, Skyladyby comes with an I.O. scheduler that provides built-in priority classes for latency-sensitive and insensitive queries, as well as the coordinated I.O scheduling across the shards on one node to maximize disk performance. Finally, SkyladyB's install scripts come with a performance auto-tuning step by applying the optimal database configuration based on the available resources. In consequence, a clear performance advantage of Skylid B can be expected. SkyladyB allows the user to control whether data should reside in the DB cache or bypass it for

Starting point is 00:04:53 rarely accessed partitions. Skyladyby allows the client to reach the node and CPU core, shard, that owns the data. This provides lower latency, consistent performance and perfect load balancing. SkyladyB also provides workload prioritization, which provides the user different SLAs for different workloads to guarantee lower latency for certain crucial workloads. The MongoDB distributed architecture, two operation modes for high availability and scalability. The MongoDB database architecture offers two cluster modes that are described in the following sections. A replica set cluster targets high availability, while a sharded cluster targets horizontal scalability and high availability. Replica set cluster. High availability with limited scalability. The MongoDB architect

Starting point is 00:05:41 enables high availability by the concept of replica sets. MongoDB replica sets follow the concept of primary secondary nodes, where only the primary handles the right operations. The secondaries hold a copy of the data and can be enabled to handle read operations only. A common replica said deployment consists of two secondaries, but additional secondaries can be added to increase availability or to scale read heavy workloads. MongoDB supports up to 50 secondaries within one replica set, secondaries will be elected as primary in case of a failure at the former primary. Regarding geo-distribution, MongoDB supports geo-distributed deployments for replica sets to ensure high availability in case of data center failures. In this context, secondary instances can be

Starting point is 00:06:28 distributed across multiple data centers, as shown in the following figure. In addition, secondaries with limited resources or network constraints can be configured with a priority to control their electability as primary in case of a failure. Sharded cluster, horizontal scalability and high availability with operational complexity. MongoDB supports horizontal scaling by sharding data across multiple primary instances to cope with right intensive workloads and growing data sizes. In a sharded cluster, each replica set consisting of one primary and multiple secondaries represents a shard. Since MongoDB 4, four secondaries can also be used to handle red requests by using the hedged read option. To enable sharding, additional MongoDB node types are required.

Starting point is 00:07:14 Query routers, Mongo's, and config servers. A Mongo's instance acts as a query router, providing an interface between client applications and the sharded cluster. In consequence, clients never communicate directly with the shards, but always via query router. Query routers are stateless and lightweight components that can be operated on dedicated resources or together with the client applications. It is very very much. It is very recommended to deploy multiple query routers to ensure the accessibility of the cluster because the query routers are the direct interface for the client drivers. There is no limit to the number of query routers, but as they communicate frequently with the config servers, it should be noted

Starting point is 00:07:53 that too many query routers can overload the config servers. Config servers store the metadata of a sharded cluster, including state and organization for all data and components. The metadata includes the list of chunks on every shard and the ranges that define the chunks. Config servers need to be deployed as a replica set itself to ensure high availability. Data sharding in MongoDB is done at the collection level, and a collection can be sharded based on a shard key. MongoDB uses a shard key to determine which documents belong on which shard. Common shard key choices include the underscore id field and the field with a high cardinality, such as a timestamp or user ID.

Starting point is 00:08:34 MongoDB supports three sharding strategies, range-based, hash-based, and zone-based. Ranged sharding partitions documents across shards according to the shard key value. This keeps documents with shard key values close to one another and works well for range-based queries, e. G, on time series data. Hashed sharding guarantees a uniform distribution of rights across shards, which favors right workloads. Zone sharding allows developers to define custom sharding rules, for instance, to ensure that the most relevant data reside on shards that are geographically closest to the application servers. Also, sharded clusters can be deployed in a geo-distributed setup to overcome data center failures, as depicted in the following figure. The Skyladybee architecture,

Starting point is 00:09:19 multi-primary for high availability and horizontal scalability. Unlike MongoDB, Skyla-Db does not follow the classical RDBMs architectures with one primary node and multiple secondary nodes, but uses a decentralized structure, where all data is systematically distributed and replicated across multiple nodes forming a cluster. This architecture is commonly referred to as multi-primary architecture. A cluster is a collection of interconnected nodes organized into a virtual ring architecture, across which data is distributed. The ring is divided into v nodes, which represent a range of tokens assigned to a physical node, and are replicated across physical nodes according to the replication factor set for the key space. All nodes are considered equal, in a multi-primary sense. Without adafined leader,

Starting point is 00:10:06 the cluster has no single point of failure. Nodes can be individual on-premises servers or virtual servers, public cloud instances, composed of a subset of hardware on a larger physical server. On each node, data is further partitioned into shards. Shards operate as mostly independently operating units, known as a shared nothing design. This greatly reduces contention and the need for expensive processing locks. All nodes communicate with each other via the gossip protocol. This protocol decided decides in which partition which data is written and searches for the data records in the right partition using the indexes. When it comes to scaling, Skylidiby's architecture is made for easy horizontal's harding across multiple servers and regions. Sharding in Skyladyby is done at the table level,

Starting point is 00:10:52 and a table can be sharded based on a partition key. The partition key can be a single column or a composite of multiple columns. Skylidib also supports range-based sharding, where rows are distributed across shards based on the partition key value range, as well as hash-based sharding for equally distributing data and to avoid hot spots. Additionally, SkyladyB allows for data to be replicated across multiple data centers for higher availability and lower latencies. In this multi-data center or multi-region setup, the data between data centers is asynchronously replicated. On the client side, applications may or may not be aware of the multi-data center deployment, and it is up to the application developer to decide on the awareness to fallback data centers. This can be configured via the

Starting point is 00:11:39 read and write consistency options that define if queries are executed against a single data center or across all data centers. Load balancing in a multi-data center setup depends on the available settings within the specific programming language driver. A comparative scalability viewpoint on the distributed architectures of MongoD band Skyladyby. When it comes to scalability, the significantly different distribution approaches of both Skyladyby and MongoDB need to be considered, especially for self-managed clusters running on-premises or on IAS. MongoDB's architecture easily allows scaling read-heavy workloads by increasing the number of secondaries in a replica set. Yet, for scaling workloads with a notable right proportion,

Starting point is 00:12:20 the replica sets need to be transformed into a sharded replica set and this comes with several challenges. First, two additional MongoDB services are required. In-quiry routers, mongoes, and a replica set of config servers to ensure high availability. Consequently, considerably more resources are required to enable sharding in the first place. Moreover, the operational complexity clearly increases. For instance, a sharded cluster with three shards requires a replica set of three Mongo's instances, a replica set of three config servers and three shards, each shard consisting of one primary and at least two secondaries. The second challenge is the repartitioning of data in the sharded cluster. Here, MongoDB applies a constantly running

Starting point is 00:13:05 background task that autonomously triggers the redistribution of data across the shards. The repartitioning does not take place as soon as a new shard is added to the cluster, but when certain internal thresholds are reached. Consequently, increasing the number of shards will immediately scale the cluster but may have a delayed scaling effect. Until MongoDB version 5,0, MongoDB engineers themselves recommend to not shard, butrather to scale vertically with bigger machines if possible. Scaling a Skyladyby cluster is comparably easy and transparent for the user thanks to Skyladyby's multi-primary architecture. Here, each node is equal, and NO additional services are needed to scale the cluster to hundreds of nodes. Moreover, data repartitioning is triggered as soon as a new

Starting point is 00:13:51 node is added to the cluster. In this context, Skyletibb offers clear advantages over MongoDB. First, Thanks to the consistent hashing approach, data does not need to bear a partitioned across the full cluster, only across a subset of nodes. Second, the partitioning starts with adding the new node, which eases the timing of the scaling action. This is important, since repartitioning will put some additional load on the cluster and should be avoided at peak workload phases. The main scalability differences are summarized in the following table, conclusion and outlook. When you compare two distributed noSQL databases, you always discover some parameters. parallels, but also numerous considerable differences. This is also the case here with SkylaDB versus MongoDB.

Starting point is 00:14:37 Both databases address similar use cases and have a similar product and community strategy. But when it comes to the technical side, you can see the different approaches and focus. Both databases are built for enabling high availability through a distributed architecture. But when it comes to the target workloads, MongoDB enables easily getting started with single node or replica said deployments that fit well for small and medium workloads, while addressing large workloads and datasets becomes a challenge due to the technical architecture. Skyladybee clearly addresses performance-critical workloads that demand for easy and high scalability, high throughput, low and stable latency, and everything INA multi-data center deployment. This is also shown by

Starting point is 00:15:19 data-intensive use cases of companies such Discord, Numberly or T-R-A-C-T-I-N that migrated from MongoDB to SkylaDB to successfully solve performance problems. And to provide further insights into their respective performance capabilities, we provide a transparent and reproducible performance comparison in a separate benchmark report that investigates the performance, scalability, and costs for MongoDB Atlas and SkylaDB Cloud. Additional SkyloDib versus MongoDB comparison details. See the complete Bench Ant MongoDB versus SkyloDB comparison for an extended version of this technical comparison, including details comparing, data model, query language, use cases and customer examples, data consistency options, first-hand operational experience. Thank you for

Starting point is 00:16:06 listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - MongoDB vs ScyllaDB: Architecture Comparison

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.