The Good Tech Companies - Cache vs. Database: Comparing Memcached and ScyllaDB

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Cash versus Database comparing Memcashed and Skyladyby by Skyladyby, an in-depth look at database and cash internals and the trade-offs in each. Skyladyby would like to publicly acknowledge Dormondo, Memcashd maintainer, and Danny Copping for their contributions to this project, as well as thank them for their support and patience. Engineers behind Skyladyby, the database for predictable performance at scale. joined forces with Memcashd maintainer Dormondo to compare both technologies head-to-head,

Starting point is 00:00:34 in a collaborative vendor-neutral way. The results reveal that both Memcashed and Skyladyby maximized disks and network bandwidth while being stressed under similar conditions, sustaining similar performance overall. While Skyladyby required data modeling changes to fully saturate the network throughput, Memcashd required additional I.O threads to saturate disk I.O. Although Skyladyby showed better latencies when compared to Memcashed pipe, declined requests to disk, Memcashd latencies were better for individual requests. This document explains our motivation for these tests, provides a summary of the tested scenarios and results, then presents recommendations for anyone who might be deciding between Skyladyby and

Starting point is 00:01:14 Memcashed. Along the way, we analyze the architectural differences behind these two solutions and discuss the trade-offs involved in each. There's also a detailed guidebook for this project, with a more extensive look at the tests and results and links to the specific configurations you can use top perform the tests yourself. Bonus. Dormondo and I recently discussed this project at P99 Conference, a highly technical conference on performance and low latency engineering. Watch on demand why have we done this? First and foremost, Skyladyby invested lots of time and engineering resources optimizing our database to deliver predictable low latencies for real-time data-intensive applications. Skyladyby's shard per core, shared nothing architecture,

Starting point is 00:01:58 user space i.o scheduler and internal cache implementation, fully by passing the Linux page cache, are some notable examples of such optimizations. Second, performance converges over time. In memory caches have been, for a long time, regarded as one of the fastest infrastructure components around. Yet, it has been a few years now since caching solutions started to look into the realm of flash disks. These initiatives obviously pose an interesting question, If an in memory cache can rely on flash storage, then why can't a persistent database also work as a cache? Third, we previously discussed seven reasons not to put a cache in front of your database and recently explored how specific teams have successfully replaced their caches with Skyladyby.

Starting point is 00:02:42 Fourth, at last year's P99 conference, Danny Copping gave us an enlightening talk, Cash Me If You Can, where he explained how Memcashed Exte store helped Grafana Labs scale their cash footprint 42x while driving down costs. And finally, despite the valid criticism that performance benchmarks receive, they still play an important role in driving innovation. Benchmarks are a useful resource for engineers seeking in-house optimization opportunities. Now, on to the comparison. Setup. Instances tests were carried out using the following AWS instance types. Loader C7i 16 extra large, 64 vCPUs, 128 gigabytes RAM, Memcashed. I4i. 4x4-large, 16 vCpus, 128-gabytes RAM, 3.75 terabytes nv me. Skylidibe. I4i. 4 extra large, 16 V-CpUs, 128 Gigabytes Ram, 3. 75 terabytes

Starting point is 00:03:44 NVME. All instances can deliver up to 25 gigabits per second of network bandwidth. Keep in mind that specially during tests maxing out the promised network capacity, we noticed throttling shrinking down the bandwidth down to the instances baseline capacity. Optimizations A&D settings to overcome potential bottlenecks, the following optimizations and settings were applied. AWS side. All instances used a cluster placement strategy following the AWS docs. This strategy enables workloads to achieve the low latency network performance necessary for

Starting point is 00:04:17 tightly coupled node-to-node communication that is typical of high-performance computing, HPCC applications. Quote.mcached, version 1, 6.25, compiled with Extor-enabled, except where denoted, run with 14 threads, pinned to specific CPUs. The remaining two V-CPUs were assigned to CPU C-0, Core and HT sibling, to handle network IRQs, as specified by the SQ underscore split mode in C-Star perftune. P.Y. CAS operations were disabled to save space on per item overhead. The full command line arguments were tasks at circa 1 to 7, 9 to 15, USR, local, memcashed, bin, memcashed VARM-M-11-11, circa 4096 lock memory threads 14 U-U Skyla C, Skyla-Dibbe. Default settings is configured by Skyladyby Enterprise 2024.

Starting point is 00:05:15 1.2 Amy. Amy Id. Amy 018335B-47-B-DF-9A in an I-4I. For extra large, this includes the same CPU pinning settings as described above for Memcashed. Stressors for Memcashd loaders, we used McSredder, part of Memcash's official testing suite. The applicable stressing profiles are in the Phi Mendes, Shredder's GitHub repository. For Skylidi, we use. used Cassandra Stress as shipped with Skyladyby and specified comparable workloads as the ones used for memcashed. Tests A&D results the following as a summary of the tests we conducted and their results. If you want a more detailed description and analysis, go to the extended write-up of this project.

Starting point is 00:06:01 Ram caching efficiency the more items you can fit into RAM, the better your chance of getting cash hits. More cash hits result in significantly faster access than going to disk. Ultimately, that improves latency. This project began by measuring how many items we could store to each data store. Throughout our tests, the key was between 4 to 12 bytes, key 0. Cain for Memcached, and 12 bytes for Skylid B. The value was fixed to 1,000 bytes. M-E-M-Cashd stored roughly 101M items until eviction started. It's memory efficient. Out of Memcash's 114G assigned memory, this is approximately 101-1-G-1-1-1-1-1.2. worth of values without considering the key size and other flags. Skylid B Skyla DB stored between

Starting point is 00:06:48 60 to 61M items before eviction started. This is no surprise, given that its protocol requires more data to be stored as part of a right, such as the right timestamp since Epic, Roe liveness, etc. Skylid also persists data to disk as you go, which means that Bloom filters and optionally indexes need to be stored in memory for subsequent disc lookups. Takeaways Memcats stored approximately 65% more in memory items than Skyladybee. Skyladyb rows have higher per item overhead to support a wide column orientation. In Skyladyby, Bloom filters, index caching, and other components are also stored in memory to support efficient disk lookups, contributing to yet another layer of overhead.

Starting point is 00:07:31 Read only in memory workload. The ideal, though unrealistic, workload for a cache is one where all the data fits in RAM, so that reeds don't require disc accesses and no evictions or misses occur. Both SkyladyB and Memcash employ LRU, least recently used, logic for freeing up memory. When the system runs under pressure, items get EVIC TED from the LRU's tail. These are typically the least active items. Taking evictions in cash misses out of the picture helps measure and set a performance baseline for both data stores. It places the focus on what matters most for these kinds of workloads.

Starting point is 00:08:08 throughput and request latency. In this test, we first warmed up both stores with the same payload sizes used during the previous test. Then, we initiated reeds against their respective ranges for 30 minutes. MEMC-A-C-H-E-M-Cashd achieved an impressive 3 million gets per second, fully maximizing a WSNIC bandwidth, 25 gigabits per second. Memcashd kept a steady 3M-RPS, fully maximizing the NIC throughput the parsed results showed that P-99 999 responses completed below 1 millisecond. STAT CMD underscore get total ops 5,503,000,496 rate. 3,060,9008 per second equals equals equals equals timer MG equals equals equals equals 1 to 10 US 00.00% 10 to 99 US 34350430436.

Starting point is 00:09:06 26. 238% 100 to 99 US 516305763493. 762% 1 to 2 milliseconds 11,500. 0002-1% Skylid B to read more rows in Skyladybee, we needed to devise a better data model for client requests due to protocol characteristics, in particular, no pipelining. With a clustering key, we could fully maximize Skyladybee's cache, results. resulting in a significant improvement in the number of cached rows. We ingested 5M partitions, each with 16 clustering keys, for a total of 80M cached rows.

Starting point is 00:09:47 As a result, the number of records within the cache significantly improved compared to the key value numbers shown previously. As Dormondo correctly pointed out, thanks, this configuration is significantly different than the previous MECCashed setup. While the MEMCashed workload always hits an individual key value pair, a single request in SkyLenChi Dhabi results in several rows being returned. Notably, the same results could be achieved using Memcashd by feeding the entire payload as the value under a single key, with the results scaling accordingly. We explained the reasons for these changes in the detailed write-up. There, we covered characteristics of the CQL protocol, such as the per-item-itim overhead, compared to Memcashed, and no support for pipelining, which make wide partitions more efficient on Skyladyby than single-key

Starting point is 00:10:33 Fetches. With these adjustments, our loaders ran a total of 187K readops per second over 30 minutes. Each operation resulted in 16 rows getting retrieved. Similarly to Memcashed, Skyladyby also maximized the NIC throughput. It served relief 3M rows per seconds solely from in-memory data. SkyladyB exposes server-side latency information, which is useful for analyzing latency without the network. During the test, SkyladyB's server-side P-99 latency, remained within one millisecond bounds. The client-side percentiles are, unsurprisingly, higher than the server citilatency with a reed P-99 of zero. Nine milliseconds, takeaways both Memcashed and Skyladybee fully saturated the network. To prevent saturating the maximum network packets

Starting point is 00:11:21 per second, Memcashd relied on request pipelining, whereas Skyladyby was switched to a wide column orientation. Skyladyby's cache showed considerable gains following a wide column schema, able to store more items compared to the previous simple key value orientation. On the protocol level, Memcash's protocol is simpler and lightweight, whereas Skyladyby's CQL provides richer features but can be heavier. Adding disks to the picture, measuring flash storage performance introduces its own set of challenges, which makes it almost impossible to fully characterize a given workload realistically. For disc-related tests, we decided to measure the most pessimistic situation,

Starting point is 00:12:00 Compare both solutions serving data, mostly from block storage, knowing that the likelihood of realistic workloads doing this is somewhere close to zero. Users should expect numbers in between the previous optimistic cache workload and the pessimistic diskbound workload in practice. Memcashed ExtE store the Exte Store wiki page provides extensive detail into the solutions interworkings. At a high level, it allows Memcash to keep its hash table and keyson memory, but store values onto external storage. During our tests, we populated Memcash with 1.25B items with a value size of 1KB and a key size of up to 14 bytes. With ExtE store, we stored around 11x the number of items compared to the previous in memory workload until evictions started to kick in, as shown in the right

Starting point is 00:12:47 hand panel in the image above. Even though 11x is an already impressive number, the total data stored on Flash was only 1.25 terabytes out of the total 3. 5 terabytes provided by the AWS instance. Read only performance for the actual performance tests. We stressed Exte store against item sizes of 1K band 8KB. The table below summarizes the results. Test type items per GET payload size I.O threads get rate P99 Perferin underscore Medellé underscore Pipe 161KB 321888K S4 to 5 M. S. M. S. S. M. S. M. S. M. S. M. Ferfron underscore, Madaj forn underscore MEDage, underscore Pipe 161 K, 6, 4261K. S 5 to 6 ms. Perfron underscore Mettig B. 6,256k. S. 1 to 2MS Perfron underscore Medige underscore Pipe 168 K B 16992K. S 5 to 6MS perforin underscore Medijon 18K B 169

Starting point is 00:13:53 S less than 1 MS Perferin underscore Medelléjé underscore Pipe 168KB 32110K, S3 to 4 MS Perferin underscore Medijay 1 8KB 32105k, S less than 1MSS-CY LLADB-W populated SkylidiB with the same number of items as used for Memcashd. Although SkyladyB showed higher get rates than Memcashed, it did so under slightly higher tail latencies compared to Memcash's non-piplining workloads. This is summarized below. Test type items per GET payload size get rate server side P99 client side P991KB read 11 potassium boride. 8K S2 milliseconds 2.4 milliseconds 8KB read 1 880 potassium boride.

Starting point is 00:14:41 8K S1.54 milliseconds 1.9MS take a ways ExtE store required considerable tuning to its settings in order to fully saturate flash storage IO. Due to Memcashed architecture, smaller payloads are unable to fully utilize the available disk space, providing smaller gains compared to Skyladybee. Skyladybe rates were overall higher than Memcashed in a key value orientation, especially under higher payload sizes. Latencies were better than pipelined requests, but slightly higher than individual gets in Memcashed. Overwrite workload, following our previous disk results, we then compared both solutions in a read mostly workload targeting the same throughput. 250K ops sec. The workload in question is a slight modification of Memcash's basic test for Exte store with 10% random overrights. It is considered a semi-wurst case scenario. Quote.m-e-M-C-H-E-M-Cashd achieved a rate of slightly under 249K during the test.

Starting point is 00:15:43 Although the right rates remain steady during the duration of the test, we observed De Treads fluctuated slightly throughout the run. We also observed slightly high E-X-T. store underscore IO underscore Q metrics despite the lowered read ratios, but latencies still remained low. These results are summarized below. Operation IO threads rate P99 latency CMD underscore gets 64224K S1 to 2MSCMD underscore set 6,424.8K S less than 1 MSS-CY LLADB test was run using two loaders, each with half of the target rate. Even though Skyladyby achieved a slightly higher throughput, 259. 5K.

Starting point is 00:16:27 The right latencies were kept low throughout the run and the Reed latencies were higher. Similarly as with Memcashed, the table below summarizes the client side run results across the two loaders. Loader rate write P99 read P99 Loter 1124, 9K, S1. 4 milliseconds 2, 6 MS loader 2124, 6K S1. milliseconds to 6MS takeaways both Memcast and Skyladyby write rates were steady, with reads slightly fluctuating throughout the run. Skyladyby rights still account for the commit log overhead, which sits in the hot right path. Skyladyby server side latencies were similar to those observed in Memcashed results, although client side latencies were slightly higher. Read a more

Starting point is 00:17:12 detailed analysis in the Gitbook for this project wrapping up. Both Memcashed and Skylo DiB managed to maximize the underlying hardware utilization across all tests and keep latencies predictably low. So which one should you pick? The real answer, it depends. If your existing workload can accommodate a simple key value model and it benefits from pipelining, then Memcashd should be more suitable to your needs. On the other hand, if the workload requires support for complex data models, then Skylidibi is likely a better fit. Another reason for sticking with Memcashed, it easily delivers traffic far beyond what a nick can sustain. In fact, in this Hacker news thread, Dorman mentioned that he could scale it up past 55 million readops sec for a considerably larger server.

Starting point is 00:17:58 Given that, you could make use of smaller and or cheaper instance types to sustain a similar workload, provided the available memory and disk footprint suffice your workload needs. A different angle to consider is the data set size. Even though Exte store provides great cost savings by allowing you to store items beyond RAM, there's a limit to how many keys can fit per gigabyte of memory. Workloads with very small items should observe smaller gains compared to those with larger items. That's not the case with Skyladyby, which allows you to store billions of items irrespective of their sizes. It's also important to consider whether data persistence is required. If it is, then running Skyladyby as a replicated distributed cache provides you greater resilience in nonstop operations,

Starting point is 00:18:41 with the trade-off being, and as Memcashed correctly states, that replication halves your effective cash size. Unfortunately, ExtT store doesn't support warm restarts and thus the failure or maintenance of a single node is prone to elevating your cash miss ratios. Whether this is acceptable or not depends on your application semantics, if a cache miss corresponds to a round-trip to the database, then the end-to-end latency will be momentarily higher. With regards to consistent hashing, Memcashed clients are responsible for distributing keys across your distributed servers. This may introduce somaic cups, as different client configurations will cause keys to be assigned differently, and some implementations may not be compatible with each other.

Starting point is 00:19:24 These details are outlined in Memcashd's configuring client wiki. Skyletiby takes a different approach. Consistent hashing is done at the server level and propagated to clients when the connection is first established. This ensures that all connected clients always observe the same topology. as you scale. So who won, or who lost? Well, this does not have to be a competition, nor an exhaustive list outlining every single consideration for each solution. Both Skyla DB and Memcashd used different approaches to efficiently utilize the underlying infrastructure.

Starting point is 00:19:57 When configured correctly, both of them have the potential to provide great cost savings. We were pleased to see SkylaDB matching the numbers of the industry recognized Memcashd. Of course, we had no expectations of our data. database being faster. In fact, as we approach microsecond latencies at scale, the definition of faster becomes quite subjective. Slightly smiling face thank you for listening to this Hackernoon story, read by artificial intelligence. Visit Hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Cache vs. Database: Comparing Memcached and ScyllaDB

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.