The Good Tech Companies - How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database
Episode Date: February 3, 2026This story was originally published on HackerNoon at: https://hackernoon.com/how-sharechat-scaled-their-ml-feature-store-1000x-without-scaling-the-database. How ShareCha...t scaled its ML feature store 1000× using ScyllaDB, smarter data modeling, and caching—without scaling the database. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #scaling-sharechat-ml-feature, #scylladb-feature-store, #low-latency-ml-feature, #scylladb-store-feature, #p99-latency-feature-store, #apache-flink-feature-pipeline, #scalable-recommendation-infra, #good-company, and more. This story was written by: @scylladb. Learn more about this writer by checking @scylladb's about page, and for more stories, please visit hackernoon.com. ShareChat engineers rebuilt a failing ML feature store into a system capable of serving billions of features per second—without scaling the database. By redesigning data models, optimizing tiling, improving cache locality, and tuning gRPC and GC behavior, they turned ScyllaDB into a low-latency backbone for real-time recommendations.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How ShareChat scaled their ML feature store 1000X without scaling the database by Skyladyby.
How ShareChat engineers rebuilt a low-latency ML feature store on Skyladyby after an initial
scalability failure, and what they learned along the way the demand for low latency machine
learning feature stores is higher than ever, but actually implementing one at scale remains a challenge.
That became clear when ShareChat engineers Ivan Bermistrov and Andrei Manikov took the P99
Conif 23 stage to share how they built a low latency ML feature store based on Skyladyby.
This isn't a tidy case study where adopting a new product saves the day.
It's a lessons learned story, a look at the value of relentless performance optimization,
with some important engineering takeaways.
The original system implementation fell far short of the company's scalability requirements.
The ultimate goal was to support one billion features per second, but the system failed under a load of just one million.
With some smart problem-solving, the team pulled it off, though.
Let's look at how their engineers managed to pivot from the initial failure to meet their lofty performance goal without scaling the underlying database.
ShareChat, India's leading social media platform.
To understand the scope of the challenge, it's important to know a little about ShareChat,
the leading social media platform in India.
On the ShareChat app, users discover and consume content in more than 15 different languages,
including videos, images, songs and more.
ShareChat also hosts a TikTok-like short video platform, Maj, that encourages users to be creative
with trending tags and contests.
Between the two applications, they serve a rapidly growing user base that already has over
325 million monthly active users.
and their AI-based content recommendation engine is essential for driving user retention and engagement.
Feature Stores at ShareChat, this story focuses on the system behind ML feature stores for the
short form video app Mudge. It offers fully personalized feeds to around 20 million daily active
users, 100 million monthly active users. Feeds serve 8,000 requests per second, and there's an
average of 2,000 content candidates being ranked on each request. For example, to find the 10 best
items to recommend. Features are pretty much anything that can be extracted from the data.
Ivan Vermistrov, principal staff software engineer at ShareChat, explained, greater than we compute
features for different entities. Post is one entity, user is greater than another and so on.
From the computation perspective, they're quite similar. Greater than however, the important difference
is in the number of features we need to greater than fetch for each type of entity. When a user requests
a feed, we fetch user greater than features for that single user. However, to rank all the posts,
we need to greater than fetch features for each candidate, post being ranked, so the total load on
greater than the system generated by post features is much larger than the one generated by greater
than user features. This difference plays an important role in our story. Why the initial
feature store architecture failed to scale. At first, the primary focus was on building a real-time
user feature store because, at that point, user features were more.
most important. The team started to build the feature store with that goal in mind. But then priorities
changed and post features became the focus too. This shift happened because the team started building
an entirely new ranking system with two major differences versus its predecessor. Near real-time
post features were more important. The number of posts to rank increased from hundreds to thousands.
Ivan explained, when we went to test this new system, it failed miserably. At around 1 million features per second,
the system became unresponsive, latencies went through the roof and so on. Ultimately, the problem
stemmed from how the system architecture used pre-aggregated data buckets called tiles. For example,
they can aggregate the number of likes for a post in a given minute or other time range.
This allows them to compute metrics like the number of likes for multiple posts in the last two
hours. Here's a high-level look at the system architecture. There are a few real-time topics with
raw data, likes, clicks, etc. A fling.
Linked job aggregates them into tiles and writes them to Skylidibi.
Then there's a feature service that requests styles from Skyladyby, aggregates them and
returns results to the feed service.
The initial database schema and tiling configuration led to scalability problems.
Originally, each entity had its own partition, with rows timestamp and feature name being
ordered clustering columns.
Learn more in this NoSQL data modeling masterclass.
Tiles were computed for segments of one minute, 30 minutes, and one day.
Quering one hour, one day, seven days or 30 days required fetching around 70 tiles per feature on average.
If you do the math, it becomes clear why it failed.
The system needed to handle around 22 billion rows per second.
However, the database capacity was only 10 million rows sec.
Early feature store optimizations.
Data modeling and tiling changes.
At that point, the team went on an optimization mission.
The initial database schema was updated to store all feature rows.
together serialized as protocol buffers for a given timestamp. Because the architecture was already
using Apache Flink, the transition to the new tiling schema was fairly easy, thanks to Flink's
advanced capabilities in building data pipelines. With this optimization, the features multiplier has been
removed from the equation above, and the number of required rows to fetch has been reduced by
100x, from around 2 billion to 200 million rows, sec. The team also optimized the tiling configuration,
Adding additional tiles for 5 minutes, 3 hours and 5 days to 1 minute, 30 minutes and 1 day tiles.
This reduced the average required tiles from 70 to 23, further reducing the rows, sec to around 73 million.
To handle more rows, sec on the database side, they changed the Skylib Compaction Strategy from incremental to leveled.
Learn more about Compaction Strategies. That option better suited their query patterns,
keeping relevant rows together and reducing Red I.O.
The result, SkyladyB's capacity was effectively doubled.
The easiest way to accommodate the remaining load would have been to scale SkyladyB 4x.
However, more, larger clusters would increase costs and that simply wasn't in their budget.
So the team continued focusing on improving the scalability without scaling up the Skyladyb cluster.
Improving feature store cache locality to reduce database load,
one potential way to reduce the load on SkyladyB was to improve the local cash hit rate,
so the team decided to research how this could be achieved.
The obvious choice was to use a consistent hashing approach,
a well-known technique do direct a request to a certain replica from the client based on some information about the request.
Since the team was using EngineX ingress in their Kubernetes setup,
using NGINX's capabilities for consistent hashing seemed like he a natural choice.
Per EngineX Ingress documentation, setting up consistent hashing would be as simple as adding three lines of code.
What could go wrong, a bit. This simple configuration didn't work. Specifically, the client subset led to a huge key remapping, up 100% in the worst case. Since the node keys can be changed in a hash ring, it was impossible to use real-life scenarios with autoscaling. See the Ingress implementation. It was tricky to provide a hash value for a request because Ingress doesn't support the most obvious solution, a GRPC header. The latency suffered severe degradation,
and it was unclear what was causing the tail latency.
To support a subset of the pods, the team modified their approach.
They create a two-step hash function, first hashing an entity, then adding a random prefix.
That distributed the entity across the desired number of pods.
In theory, this approach could cause a collision when an entity is mapped to the same pod several times.
However, the risk is low given the large number of replicas.
Ingress doesn't support using GRPC header as a variable.
but the team found a workaround, using path rewriting and providing the required hash key in the path itself.
The solution was admittedly a bit, hacky, but it worked. Unfortunately, pinpointing the cause of latency
degradation would have required considerable time, as well as observability improvements.
A different approach was needed to scale the feature store in time. To meet the deadline, the team
split the feature service into 27 different services and manually split all entities between them
on the client. It wasn't the most elegant approach, but it was simple and practical, and it achieved
great results. The cash hit rate improved to 95% and the SkyladyB load was reduced to 18.
4 million rows per second. With this design, ShareChat scaled its feature store to 1B features
per second by March. However, this old-school deployment splitting approach still wasn't the ideal
design. Maintaining 27 deployments was tedious and inefficient. Plus, the cash hit rate wasn't
stable, and scaling was limited by having to keep a high minimum pod count in every deployment.
So even though this approach technically met their needs, the team continued their search for
a better long-term solution. The next phase of optimizations, consistent hashing, feature
service. Ready for yet another round of optimization, the team revisited the consistent
a enthrashing approach using a sidecar called Envoy proxy deployed with the feature service.
Envoy proxy provided better observability which helped identify the latency tail issue.
The problem, different request patterns to the feature service caused a huge load on the
GRPC layer and cache. That led to extensive Mutex contention. The team then optimized the feature
service. They forked the caching library, fast cache from Victoria Metrics, an implemented batch
rights and better eviction to reduce Mutex contention by 100x. Forked GPRC go and implemented buffer
pool across different connections to avoid contention during high parallelism. Used object pooling
and tuned garbage collector, GC, parameters to reduce allocation rates and GC cycles. With
Envoy proxy handling 15% of traffic in their proof of concept, the results were promising.
A 98% cash hit rate, which reduced the load on Skyladyby to 7.
For Moros, SEC, they could even scale the feature store more, from 1 billion features per second to 3 billion features per second.
Lessons learned from scaling a high-performance feature store.
Here's what this journey looked like from a timeline perspective.
To close, Andre summed up the team's top lessons learned from this project.
So far, use proven technologies.
Even as the share chat team drastically changed their system design, Skyladyby, Apache Flink and Victoria Metrics continued working well.
Each optimization is harder than the previous one and has less impact.
Simple and practical solutions, such as splitting the feature store into 27 deployments, do indeed work.
The solution that delivers the best performance isn't always user-friendly.
For instance, their revised database schema yields good performance, but is difficult to maintain
and understand.
Ultimately, they wrote some tooling around it to make it simpler to work with.
Every system is unique.
Sometimes you might need to fork a default live.
and adjust it for your specific system to get the best performance.
Thank you for listening to this Hackernoon story, read by artificial intelligence.
Visit hackernoon.com to read, write, learn and publish.
