The Good Tech Companies - A Tale from Database Performance at Scale

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. A tale from Database Performance at Scale by Cilidibi. Database performance is serious business, but why not have a little fun exploring its challenges and complexities? Winking Face hears a rather fanciful story we presented in Chapter 1 of Database Performance at Scale, a free open access book. The technical topics covered here are expanded on throughout the book, but thesis the one and only time we talk about poor Patrick. His struggles bring you some valuable lessons, solace in your own database performance predicaments, and maybe a few chuckles as well.

Starting point is 00:00:37 Asterisk after losing his job at a F-A-A-N-G-M-A-A-A-N-G, Manga. Company, Patrick decided to strike off on his own and founded a niche online store dedicated to trading his absolute favorite among headware, green fedoras. Noticing that a certain Noz QL database was recently trending on the front page of Hacker News, Patrick picked it for his backhand stack. After some experimentation with the offering's free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database as a service offering. With provision through put capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers. To Patrick's disappointment, fewer than 10 customers visited the site daily.

Starting point is 00:01:24 At the same time, the shiny new database cluster kept running, fueled by a steady and flux. of money from his credit card and waiting for its potential to be harnessed. Patrick's Diary of Lessons Learned, Part 1. The lessons started right away. Although some databases advertise themselves as universal, most of them perform best for certain kinds of workloads. The analysis before selecting a database for your own needs must include estimating the characteristics of your own workload. Is it likely to be a predictable, steady flow of requests? E.G. Updates being fetched from other systems periodically, is the variance high and hard to predict, with the system being idle for potentially long periods of time, with occasional bumps of activity?

Starting point is 00:02:06 Database as a service offerings often let you pick between provisioned throughput and on-demand purchasing. Although the former is more cost-efficient, it incurs a certain cost regardless of how busy the database actually is. The latter costs more per request, but you only pay for what you use. Give yourself time to evaluate your choice and avoid committing to long-term contracts, even if lured by a discount, before you see that the setup works for you in a sustainable way. The first spike, March 17th seemed like an extremely lucky day. Patrick was pleased to notice a lots of new orders starting from the early morning. But as the number of active customers skyrocketed around noon, Patrick's mood started to deteriorate. This vass strictly correlated with

Starting point is 00:02:48 the rate of calls he received from angry customers reporting their ability to proceed with their orders. After a short brainstorming session with himself in a web search engine, Patrick realized, to his dismay, that he lacked any observability tools on his precious and quite expensive database cluster. Shortly after frantically setting up Grafana and browsing the metrics, Patrick saw that although the number of incoming requests kept growing, their success rate was capped at a certain level, way below today's expected traffic. Provisioned throughput strikes again, Patrick groaned to himself, while scrolling through thousands of throughput exceeded error messages that started appearing around 11 a.m.

Starting point is 00:03:27 Patrick's Diary of Lessons Learned Part 2. This is what Patrick learned. If your workload is susceptible to spikes, be prepared for it and try to architect your cluster to be able to survive a temporarily elevated load. Database as a service solutions tend to allow configuring the provision throughput in a dynamic way, which means that the threshold of accepted requests can occasionally be raised temporarily to a previously configured level. Or, respectively, they allow it to be temporarily decreased to make the solution slightly more cost-efficient. Always expect spikes. Even if your workload is absolutely steady, a temporary hardware failure or a surprise

Starting point is 00:04:04 DDoS attack can cause a sharp increase in incoming requests. Observability is key in distributed systems. It allows the developers to retrospectively investigate a failure. It also provides real-time alerts when a likely failure scenario is detected, allowing people to react quickly and either prevent a larger failure from happening, or at least minimize the negative impact on the cluster. The first loss, Patrick didn't even manage to recover from the trauma of losing most of his potential income on the only day throughout the year during which green fedores experienced any kind of demand when the letter came. It included

Starting point is 00:04:38 an angry rant from a would-be customer, who successfully proceeded with his order and paid for it, with a receipt from the payment processing operator as proof, but is now unable to either see any details of his order, and he's still waiting for the delivery. Without further ado, Patrick browsed the database. To his astonishment, he didn't find any trace of the order either. For completeness, Patrick also poo this wishful thinking into practice by browsing the backup snapshot directory. It remained empty, as one of Patrick's initial executive decisions was to save time and money by not scheduling any periodic backup procedures. How did data loss happen to him, of all people. After studying the consistency model of his database of choice, Patrick realized that there's

Starting point is 00:05:21 consensus to make between consistency guarantees, performance, and availability. By configuring the queries, one can either demand linearizability footnote 7 at the cost of decreased throughput, or reduce the consistency guarantees and increase performance accordingly. Higher throughput capabilities were a no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without any replicas distributed in the system. Once this server failed, which happens to hardware surprisingly often, especially at large scale, the data was gone. Patrick's Diary of Lessons Learned, Part 3. Further lessons include backups are vital in a distributed environment, and there's no such thing as setting backup routines too soon. Systems fail, and backups are there to restore as much of the

Starting point is 00:06:07 important data as possible. Every database system has a certain consistency model, and it's crucial to take that into account when designing your project. There might be compromises to make. In some use cases, think financial systems, consistency is the key. In other ones, eventual consistency is acceptable, as long as it keeps the system highly available and responsive. The spike strikes again. Months went by in Patrick's sleeping schedule was even beginning to show signs of stabilization. With regular backups, a redesigned consistency model, and Areminder set in his calendar for March 16th to scale up the cluster to manage elevated traffic, he felt moderately safe. If only he knew that a 10-second video of a cat dressed as a leprecha had just gone viral in Malaysia, which,

Starting point is 00:06:54 taking time zone into account, happened around 2 a.m. Patrick's time, ruining the aforementioned sleep stabilization efforts. On the one hand, the observability suite did its job and set off a warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time, databases are seldom able to scale instantaneously, and his system of choice was no exception in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian teenagers rushed Tobolk by green hats in pursuit of ever-changing internet trends. Patrick Wasable to observe a real-life instantiation of Little's law, which he vaguely remembered from his days at the university. With a beautifully concise formula, L equals Lambda W,

Starting point is 00:07:37 you, the law can be simplified to the fact that concurrency equals throughput times latency. Greater than tip. For those having trouble with remembering the formula, think units. Greater than concurrency is just a number. Latency can be measured in seconds, while greater than throughput is usually expressed in one per second. Then, it stands to reason that in greater than order for units to match, concurrency should be obtained by multiplying greater than latency, seconds, by throughput, one per second. You're welcome, throughput. depends on the hardware and naturally has its limits, E, G. You can't expect a NVME drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing

Starting point is 00:08:18 our fingers for this assumption to be invalidated in the near future. Once the limit is hit, you can treat it as constant in the formula. It's then clear that as concurrency rises, so do S latency. For the end users, Malaysian teenagers in this scenario, it means that the latency is eventually going to cross the magic barrier for the average human perception of a few seconds. Once that happens, users get too frustrated and simply give up on trying altogether, assuming that the system is broken beyond repair. It's easy to find online articles quoting that Amazon found that 100 MSOF latency costs them 1% in sales. Although it sounds overly simplified, it is also true enough. Patrick's Diary of Lessons Learned, Part 4. The Lessons continue. Unexpected

Starting point is 00:09:04 bikes are inevitable, and scaling out the cluster might not be swift enough to mitigate the negative effects of excessive concurrency. Expecting the database to handle it properly is not without merit, but not every database is capable of that. If possible, limit the concurrency in your system as early as possible. For instance, if the database is never touched directly by customers, which is a very good idea for multiple reasons, but instead is accessed through a set of microservices under your control, make sure that the microservices are also aware of the concurrency limits and adhere to them. Keep in mind that Little's law exists, its fundamental knowledge for anyone interested in distributed systems. Quoting it often also makes you appear

Starting point is 00:09:44 exceptionally smart among peers. Backup strikes back. After redesigning his project yet again to take expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora business to finally become ramen profitable. Unfortunately, the next March 17th didn't go as smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards, which kept assuring him that the traffic was under control and capable of handling the load of customers with a healthy, safe margin. But then the dashboards stopped, kindly mentioning that the disks became severely overutilized. This seem med completely out of place given the observed concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror,

Starting point is 00:10:28 that the scheduled backup procedure coincided with the annual peak load, Patrick's Diary of Lessons Learned, Part 5 concluding thoughts. Database systems are hardly ever idle, even without incoming user requests. Maintenance operations often happen and you must take them into consideration because they're an internal source of concurrency and resource consumption. Whenever possible, schedule maintenance options for times with expected low pressure on the system. If your database management system supports any kind of quality of service configuration, it's a good idea to investigate such capabilities. For instance, it might be possible to set a strong priority for user requests over regular maintenance operations, especially during peak hours. Respectively, periods with

Starting point is 00:11:12 low user-induced activity can be utilized to speed up background activities. In the database world, systems that use a variant of LSM trees for underlying storage need to perform quite a bit of compactions, a kind of maintenance operation on data, in order to keep the read, write performance predictable and steady. The end, about Peter Sarna Peter as a software engineer who is keen on open source projects and the Rustan C++ languages. He previously developed an open source distributed file system and had a brief adventure with the Linux kernel during an Apprentice Shepot Samsung electronics. He's also a longtime contributor and maintainer of Cilidi as well as LibSQL. Peter graduated from University of Warsaw with an MSC in computer science. He is a co-author of the books,

Starting point is 00:11:59 database performance at scale, and writing for developers, blogs that get read. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn, and publish.

The Good Tech Companies - A Tale from Database Performance at Scale

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.