The Good Tech Companies - A Tale from Database Performance at Scale
Episode Date: August 7, 2025This story was originally published on HackerNoon at: https://hackernoon.com/a-tale-from-database-performance-at-scale. A humorous yet insightful tale of database pitfal...ls, from costly overprovisioning to data loss, spikes, backups, and key scalability lessons. Check more stories related to cloud at: https://hackernoon.com/c/cloud. You can also check exclusive content about #database-performance-at-scale, #scalability, #backups, #distributed-systems, #scylladb, #workload-analysis, #nosql, #good-company, and more. This story was written by: @scylladb. Learn more about this writer by checking @scylladb's about page, and for more stories, please visit hackernoon.com. Patrick’s green fedora shop becomes a crash course in database performance. From overpaying for unused capacity to losing data, mishandling traffic spikes, and poorly timed backups, each blunder reveals vital lessons on workload analysis, scaling, observability, consistency, and maintenance scheduling in distributed systems.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
A tale from Database Performance at Scale by Cilidibi.
Database performance is serious business, but why not have a little fun exploring its challenges
and complexities? Winking Face hears a rather fanciful story we presented in Chapter 1 of
Database Performance at Scale, a free open access book. The technical topics covered here
are expanded on throughout the book, but thesis the one and only time we talk about poor Patrick.
His struggles bring you some valuable lessons, solace in your own database performance predicaments,
and maybe a few chuckles as well.
Asterisk after losing his job at a F-A-A-N-G-M-A-A-A-N-G, Manga.
Company, Patrick decided to strike off on his own and founded a niche online store
dedicated to trading his absolute favorite among headware, green fedoras.
Noticing that a certain Noz QL database was recently trending on the front page of Hacker News,
Patrick picked it for his backhand stack.
After some experimentation with the offering's free tier, Patrick decided to sign a one-year contract with a major cloud provider to get a significant discount on its NoSQL database as a service offering.
With provision through put capable of serving up to 1,000 customers every second, the technology stack was ready and the store opened its virtual doors to the customers.
To Patrick's disappointment, fewer than 10 customers visited the site daily.
At the same time, the shiny new database cluster kept running, fueled by a steady and flux.
of money from his credit card and waiting for its potential to be harnessed. Patrick's Diary of Lessons
Learned, Part 1. The lessons started right away. Although some databases advertise themselves
as universal, most of them perform best for certain kinds of workloads. The analysis before
selecting a database for your own needs must include estimating the characteristics of your
own workload. Is it likely to be a predictable, steady flow of requests? E.G. Updates being
fetched from other systems periodically, is the variance high and hard to predict, with the
system being idle for potentially long periods of time, with occasional bumps of activity?
Database as a service offerings often let you pick between provisioned throughput and on-demand
purchasing. Although the former is more cost-efficient, it incurs a certain cost regardless of how
busy the database actually is. The latter costs more per request, but you only pay for what you
use. Give yourself time to evaluate your choice and avoid committing to long-term contracts,
even if lured by a discount, before you see that the setup works for you in a sustainable way.
The first spike, March 17th seemed like an extremely lucky day. Patrick was pleased to notice
a lots of new orders starting from the early morning. But as the number of active customers
skyrocketed around noon, Patrick's mood started to deteriorate. This vass strictly correlated with
the rate of calls he received from angry customers reporting their
ability to proceed with their orders. After a short brainstorming session with himself in a web
search engine, Patrick realized, to his dismay, that he lacked any observability tools on his
precious and quite expensive database cluster. Shortly after frantically setting up Grafana
and browsing the metrics, Patrick saw that although the number of incoming requests kept
growing, their success rate was capped at a certain level, way below today's expected traffic.
Provisioned throughput strikes again, Patrick groaned to himself, while scrolling through
thousands of throughput exceeded error messages that started appearing around 11 a.m.
Patrick's Diary of Lessons Learned Part 2. This is what Patrick learned. If your workload is susceptible
to spikes, be prepared for it and try to architect your cluster to be able to survive a temporarily
elevated load. Database as a service solutions tend to allow configuring the provision
throughput in a dynamic way, which means that the threshold of accepted requests
can occasionally be raised temporarily to a previously configured level. Or,
respectively, they allow it to be temporarily decreased to make the solution slightly more cost-efficient.
Always expect spikes.
Even if your workload is absolutely steady, a temporary hardware failure or a surprise
DDoS attack can cause a sharp increase in incoming requests.
Observability is key in distributed systems.
It allows the developers to retrospectively investigate a failure.
It also provides real-time alerts when a likely failure scenario is detected, allowing
people to react quickly and either prevent a larger failure from happening, or at least minimize
the negative impact on the cluster. The first loss, Patrick didn't even manage to recover
from the trauma of losing most of his potential income on the only day throughout the year
during which green fedores experienced any kind of demand when the letter came. It included
an angry rant from a would-be customer, who successfully proceeded with his order and paid for
it, with a receipt from the payment processing operator as proof, but is now unable to either see any
details of his order, and he's still waiting for the delivery. Without further ado, Patrick
browsed the database. To his astonishment, he didn't find any trace of the order either.
For completeness, Patrick also poo this wishful thinking into practice by browsing the backup snapshot
directory. It remained empty, as one of Patrick's initial executive decisions was to save
time and money by not scheduling any periodic backup procedures. How did data loss happen to him,
of all people. After studying the consistency model of his database of choice, Patrick realized that there's
consensus to make between consistency guarantees, performance, and availability. By configuring the queries,
one can either demand linearizability footnote 7 at the cost of decreased throughput, or reduce the
consistency guarantees and increase performance accordingly. Higher throughput capabilities were a
no-brainer for Patrick a few days ago, but ultimately customer data landed on a single server without
any replicas distributed in the system. Once this server failed, which happens to hardware surprisingly
often, especially at large scale, the data was gone. Patrick's Diary of Lessons Learned, Part 3. Further lessons
include backups are vital in a distributed environment, and there's no such thing as setting
backup routines too soon. Systems fail, and backups are there to restore as much of the
important data as possible. Every database system has a certain consistency model, and it's crucial
to take that into account when designing your project. There might be compromises to make.
In some use cases, think financial systems, consistency is the key. In other ones, eventual consistency is
acceptable, as long as it keeps the system highly available and responsive. The spike strikes again.
Months went by in Patrick's sleeping schedule was even beginning to show signs of stabilization.
With regular backups, a redesigned consistency model, and Areminder set in his calendar for March 16th
to scale up the cluster to manage elevated traffic, he felt moderately safe. If only he knew that
a 10-second video of a cat dressed as a leprecha had just gone viral in Malaysia, which,
taking time zone into account, happened around 2 a.m. Patrick's time, ruining the aforementioned
sleep stabilization efforts. On the one hand, the observability suite did its job and set off a
warning early, allowing for a rapid response. On the other hand, even though Patrick reacted on time,
databases are seldom able to scale instantaneously, and his system of choice was no exception
in that regard. The spike in concurrency was very high and concentrated, as thousands of Malaysian
teenagers rushed Tobolk by green hats in pursuit of ever-changing internet trends.
Patrick Wasable to observe a real-life instantiation of Little's law, which he vaguely remembered
from his days at the university. With a beautifully concise formula, L equals Lambda W,
you, the law can be simplified to the fact that concurrency equals throughput times latency.
Greater than tip. For those having trouble with remembering the formula, think units.
Greater than concurrency is just a number. Latency can be measured in seconds, while greater than
throughput is usually expressed in one per second. Then, it stands to reason that in greater
than order for units to match, concurrency should be obtained by multiplying greater than latency,
seconds, by throughput, one per second. You're welcome, throughput.
depends on the hardware and naturally has its limits, E, G. You can't expect a NVME
drive purchased in 2023 to serve the data for you in terabytes per second, although we are crossing
our fingers for this assumption to be invalidated in the near future. Once the limit is hit,
you can treat it as constant in the formula. It's then clear that as concurrency rises,
so do S latency. For the end users, Malaysian teenagers in this scenario, it means that the
latency is eventually going to cross the magic barrier for the average human perception of a few
seconds. Once that happens, users get too frustrated and simply give up on trying altogether,
assuming that the system is broken beyond repair. It's easy to find online articles quoting
that Amazon found that 100 MSOF latency costs them 1% in sales. Although it sounds overly simplified,
it is also true enough. Patrick's Diary of Lessons Learned, Part 4. The Lessons continue. Unexpected
bikes are inevitable, and scaling out the cluster might not be swift enough to mitigate the
negative effects of excessive concurrency. Expecting the database to handle it properly is not
without merit, but not every database is capable of that. If possible, limit the concurrency
in your system as early as possible. For instance, if the database is never touched directly
by customers, which is a very good idea for multiple reasons, but instead is accessed through
a set of microservices under your control, make sure that the microservices are also aware of
the concurrency limits and adhere to them. Keep in mind that Little's law exists, its fundamental
knowledge for anyone interested in distributed systems. Quoting it often also makes you appear
exceptionally smart among peers. Backup strikes back. After redesigning his project yet again to take
expected and unexpected concurrency fluctuations into account, Patrick happily waited for his fedora
business to finally become ramen profitable. Unfortunately, the next March 17th didn't go as
smoothly as expected either. Patrick spent most of the day enjoying steady Grafana dashboards,
which kept assuring him that the traffic was under control and capable of handling the load
of customers with a healthy, safe margin. But then the dashboards stopped, kindly mentioning that
the disks became severely overutilized. This seem med completely out of place given the observed
concurrency. While looking for the possible source of this anomaly, Patrick noticed, to his horror,
that the scheduled backup procedure coincided with the annual peak load, Patrick's Diary of Lessons
Learned, Part 5 concluding thoughts. Database systems are hardly ever idle, even without incoming
user requests. Maintenance operations often happen and you must take them into consideration
because they're an internal source of concurrency and resource consumption. Whenever possible,
schedule maintenance options for times with expected low pressure on the system. If your database
management system supports any kind of quality of service configuration, it's a good idea to investigate
such capabilities. For instance, it might be possible to set a strong priority for user requests
over regular maintenance operations, especially during peak hours. Respectively, periods with
low user-induced activity can be utilized to speed up background activities. In the database
world, systems that use a variant of LSM trees for underlying storage need to perform quite a bit of
compactions, a kind of maintenance operation on data, in order to keep the read, write performance
predictable and steady. The end, about Peter Sarna Peter as a software engineer who is keen on
open source projects and the Rustan C++ languages. He previously developed an open source distributed
file system and had a brief adventure with the Linux kernel during an Apprentice Shepot Samsung
electronics. He's also a longtime contributor and maintainer of Cilidi as well as LibSQL. Peter
graduated from University of Warsaw with an MSC in computer science. He is a co-author of the books,
database performance at scale, and writing for developers, blogs that get read. Thank you for listening
to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn, and publish.
