The Good Tech Companies - Building a Petabyte-Scale Web Archive
Episode Date: December 9, 2025This story was originally published on HackerNoon at: https://hackernoon.com/building-a-petabyte-scale-web-archive. How we cut AWS costs after a $100,000 data retrieval ...mistake by optimizing our Web Archive. Check more stories related to tech-stories at: https://hackernoon.com/c/tech-stories. You can also check exclusive content about #web-archive-architecture, #aws, #web-data, #aws-glacier-costs, #etl-pipeline-optimization, #cost-efficient-data-pipelines, #bright-data-web-archive, #good-company, and more. This story was written by: @brightdata. Learn more about this writer by checking @brightdata's about page, and for more stories, please visit hackernoon.com. Discover how Bright Data optimize its Web Archive to handle petabytes of data in AWS. Learn how a $100,000 billing mistake revealed the trade-off between write speed, read speed, and cloud costs—and how we fixed it with a cost-effective Rearrange Pipeline. Spoiler: We are hiring!
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Building a petabyte scale web archive by bright data. In an engineer's ideal world, architecture
is always beautiful. In the real world of high-scale systems, you have to make compromises.
One of the fundamental problems an engineer must think about at the start is the vicious
trade-off between write speed and read speed. Usually, you sacrifice one for the other,
But in our case, working with pita bytes of data in AWS, this compromise didn't hit our speed,
it hit the wallet. We built a system that rode data perfectly, but every time it read from the archive,
it burned through the budget in the most painful way imaginable. After all,
reading P to bytes from AWS costs money for data transfer, request counts, and storage
class retrievals. A lot of money. This is the story of how we optimized it to make it more
efficient and cost-effective, part zero. How we ended up spending $100,000 in a WS fees.
True story. A few months back, one of our solution architects wanted to pull a sample export
from a rare, low-traffic website to demonstrate the product to app potential client.
Due to a bug in the API, the safety limit on file count wasn't applied. Because the data for
this rare site was scattered across millions of archives alongside high traffic sites,
the system tried to restore nearly half of or entire historical storage to find those few pages.
That honest mistake ended up costing us nearly $100,000 in a WS fees. Now, I fixed the API bug immediately
and added strict limits, but the architectural vulnerability remained. It was a ticking time bomb.
Let me tell you the story of the bright data web archive architecture. How I drove the system
into the trap of cheap storage and how I climbed out using a rearranged pipeline.
Part 1. The Right First Legacy. When I started working on the web archive, the system was already ingesting a massive data stream, millions of requests per minute, tens of terabytes per day. The foundational architecture was built with a primary goal, capture everything without data loss. It relied on the most durable strategy for high throughput systems, append only log. One, data, HTML, JSON, as buffered. Two, once the buffer hits approximately,
300 megabytes, it is, sealed, into a tar archive.
3. The archive flies off to S3. 4. After three days, files move to S3 Glacier Deep Archive.
For the ingestion phase, this design was flawless. Storing data in Deep Archive costs pennies,
and the right throughput is virtually unlimited. The problem. That pricing nuance the architecture
worked perfectly for writing, until clients came asking for historical data. That's when I faced a fundamental
contradiction. The system writes by time, an archive from 12 p.m. contains a mix of, and the system reads
by domain. The client asks, give me all pages from for the last year. Quote. Here lies the mistake
that inspired this article. Like many engineers, I'm used to thinking about latency, IOPPS, and
throughput. But I overlooked the AWS Glacier billing model. I thought, well, retrieving a few thousand
archives is slow, 48 hours, boot it's not that expensive. The reality, AWS charges not just for the
API call, but for the volume of data restored, dollar per gigabyte retrieved. The GoldenBite,
EFFECT imagine a client requests 1,000 pages from a single domain. Because the writing logic was
chronological, these pages can be spread across 1,000 different TAR archives. To give the client these 50
megabytes of useful data, a disaster occurs. One, the system has to trigger a restore for 1,000
archives. Two, it lifts 300 gigabytes of data out of the, freezer, 1,000 archives times 300 megabytes.
3.AWS bills us for restoring 300 gigabytes. Four, I extract the 50 megabytes required and throw away
the other 299. 95 gigabytes exploding head. We were paying to restore terabytes of trash just to extract
grains of gold. It was a classic data locality problem that turned into a financial black hole.
Part 2. Fixing the mistake. The rearrange pipeline. I couldn't quickly change the ingestion method
the incoming stream is too parallel and massive to sort. On the fly, though I am working on that,
and I needed a solution that worked for already archived data, too. So, I designed the rearrange
pipeline, a background process that defragments, the archive. This is an asynchronous ETA.
extract, transform, load, process, with several critical core components.
1. Selection. It makes no sense to sort data that clients aren't asking for.
Thus, I direct all new data into the pipeline, as well as data that clients have specifically
asked to restore. We overpay for the retrieval the first time, but it never happens a second
time. Two, shuffling, grouping, multiple workers download unsorted files in parallel and
organize buffers by domain. Since the system is asynchronous, I don't worry about the incoming stream
overloading memory. The workers handle the load at their own pace. Three, rewriting. I write the
sorted files back to S3 under a new prefix to distinguish sorted files from raw ones. Before, write
pointing arrow after right pointing arrow. Four, metadata swap. In Snowflake, the metadata table is
append only. Doing or is prohibitively expensive. The solution. I found it was far more efficient
to take all records for a specific day, write them to a separate table using a join, delete the
original day's records, and insert the entire day back with the modified records. I managed to
process 300 plus days and 160 plus billion update operations in just a few hours on a four
extra large snowflake warehouse. The result this change radically altered the products economics. Pinpoint
accuracy. Now, when a client asks for, the system restores only the data where lives. Efficiency.
Depending on the granularity of the request, entire domain versus specific URLs via regex. I achieved a 10% to 80%
reduction in garbage data, retrieval, which is directly proportional to the cost. New capabilities.
Beyond just saving money on dumps, this unlocked entirely new business use cases. Because retrieving
historical data is no longer agonizingly expensive, we can now afford to extract massive
data sets for training AI models, conducting long-term market research, and building knowledge
bases for agentic AI systems to reason over, think specialized search engines. What was previously
a financial suicide mission is now a standard operation. We are hiring bright data as scaling
the web archive even further. If you enjoy high-throughput distributed systems, data engineering
at massive scale, building reliable pipelines under real-world load, pushing Node, JS to its
absolute limits, solving problems that don't appear in textbooks. Then I'd love to talk. We're hiring
strong Node, JS engineers to help build the next generation of the web archive. Having data
engineering and ETL experience is highly advantageous. Feel free to send your CV to Vodomar at
Bright Data.com. More updates coming as I continue scaling the archive, and as I keep finding new
and creative ways to break it. Thank you for listening to this Hackernoon story, read by
artificial intelligence. Visit hackernoon.com to read, write, learn and publish.
