The Good Tech Companies - How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How ShareChat scaled their ML feature store 1000X without scaling the database by Skyladyby. How ShareChat engineers rebuilt a low-latency ML feature store on Skyladyby after an initial scalability failure, and what they learned along the way the demand for low latency machine learning feature stores is higher than ever, but actually implementing one at scale remains a challenge. That became clear when ShareChat engineers Ivan Bermistrov and Andrei Manikov took the P99 Conif 23 stage to share how they built a low latency ML feature store based on Skyladyby. This isn't a tidy case study where adopting a new product saves the day.

Starting point is 00:00:42 It's a lessons learned story, a look at the value of relentless performance optimization, with some important engineering takeaways. The original system implementation fell far short of the company's scalability requirements. The ultimate goal was to support one billion features per second, but the system failed under a load of just one million. With some smart problem-solving, the team pulled it off, though. Let's look at how their engineers managed to pivot from the initial failure to meet their lofty performance goal without scaling the underlying database. ShareChat, India's leading social media platform. To understand the scope of the challenge, it's important to know a little about ShareChat,

Starting point is 00:01:21 the leading social media platform in India. On the ShareChat app, users discover and consume content in more than 15 different languages, including videos, images, songs and more. ShareChat also hosts a TikTok-like short video platform, Maj, that encourages users to be creative with trending tags and contests. Between the two applications, they serve a rapidly growing user base that already has over 325 million monthly active users. and their AI-based content recommendation engine is essential for driving user retention and engagement.

Starting point is 00:01:55 Feature Stores at ShareChat, this story focuses on the system behind ML feature stores for the short form video app Mudge. It offers fully personalized feeds to around 20 million daily active users, 100 million monthly active users. Feeds serve 8,000 requests per second, and there's an average of 2,000 content candidates being ranked on each request. For example, to find the 10 best items to recommend. Features are pretty much anything that can be extracted from the data. Ivan Vermistrov, principal staff software engineer at ShareChat, explained, greater than we compute features for different entities. Post is one entity, user is greater than another and so on. From the computation perspective, they're quite similar. Greater than however, the important difference

Starting point is 00:02:41 is in the number of features we need to greater than fetch for each type of entity. When a user requests a feed, we fetch user greater than features for that single user. However, to rank all the posts, we need to greater than fetch features for each candidate, post being ranked, so the total load on greater than the system generated by post features is much larger than the one generated by greater than user features. This difference plays an important role in our story. Why the initial feature store architecture failed to scale. At first, the primary focus was on building a real-time user feature store because, at that point, user features were more. most important. The team started to build the feature store with that goal in mind. But then priorities

Starting point is 00:03:22 changed and post features became the focus too. This shift happened because the team started building an entirely new ranking system with two major differences versus its predecessor. Near real-time post features were more important. The number of posts to rank increased from hundreds to thousands. Ivan explained, when we went to test this new system, it failed miserably. At around 1 million features per second, the system became unresponsive, latencies went through the roof and so on. Ultimately, the problem stemmed from how the system architecture used pre-aggregated data buckets called tiles. For example, they can aggregate the number of likes for a post in a given minute or other time range. This allows them to compute metrics like the number of likes for multiple posts in the last two

Starting point is 00:04:07 hours. Here's a high-level look at the system architecture. There are a few real-time topics with raw data, likes, clicks, etc. A fling. Linked job aggregates them into tiles and writes them to Skylidibi. Then there's a feature service that requests styles from Skyladyby, aggregates them and returns results to the feed service. The initial database schema and tiling configuration led to scalability problems. Originally, each entity had its own partition, with rows timestamp and feature name being ordered clustering columns.

Starting point is 00:04:39 Learn more in this NoSQL data modeling masterclass. Tiles were computed for segments of one minute, 30 minutes, and one day. Quering one hour, one day, seven days or 30 days required fetching around 70 tiles per feature on average. If you do the math, it becomes clear why it failed. The system needed to handle around 22 billion rows per second. However, the database capacity was only 10 million rows sec. Early feature store optimizations. Data modeling and tiling changes.

Starting point is 00:05:10 At that point, the team went on an optimization mission. The initial database schema was updated to store all feature rows. together serialized as protocol buffers for a given timestamp. Because the architecture was already using Apache Flink, the transition to the new tiling schema was fairly easy, thanks to Flink's advanced capabilities in building data pipelines. With this optimization, the features multiplier has been removed from the equation above, and the number of required rows to fetch has been reduced by 100x, from around 2 billion to 200 million rows, sec. The team also optimized the tiling configuration, Adding additional tiles for 5 minutes, 3 hours and 5 days to 1 minute, 30 minutes and 1 day tiles.

Starting point is 00:05:53 This reduced the average required tiles from 70 to 23, further reducing the rows, sec to around 73 million. To handle more rows, sec on the database side, they changed the Skylib Compaction Strategy from incremental to leveled. Learn more about Compaction Strategies. That option better suited their query patterns, keeping relevant rows together and reducing Red I.O. The result, SkyladyB's capacity was effectively doubled. The easiest way to accommodate the remaining load would have been to scale SkyladyB 4x. However, more, larger clusters would increase costs and that simply wasn't in their budget. So the team continued focusing on improving the scalability without scaling up the Skyladyb cluster.

Starting point is 00:06:36 Improving feature store cache locality to reduce database load, one potential way to reduce the load on SkyladyB was to improve the local cash hit rate, so the team decided to research how this could be achieved. The obvious choice was to use a consistent hashing approach, a well-known technique do direct a request to a certain replica from the client based on some information about the request. Since the team was using EngineX ingress in their Kubernetes setup, using NGINX's capabilities for consistent hashing seemed like he a natural choice. Per EngineX Ingress documentation, setting up consistent hashing would be as simple as adding three lines of code.

Starting point is 00:07:13 What could go wrong, a bit. This simple configuration didn't work. Specifically, the client subset led to a huge key remapping, up 100% in the worst case. Since the node keys can be changed in a hash ring, it was impossible to use real-life scenarios with autoscaling. See the Ingress implementation. It was tricky to provide a hash value for a request because Ingress doesn't support the most obvious solution, a GRPC header. The latency suffered severe degradation, and it was unclear what was causing the tail latency. To support a subset of the pods, the team modified their approach. They create a two-step hash function, first hashing an entity, then adding a random prefix. That distributed the entity across the desired number of pods. In theory, this approach could cause a collision when an entity is mapped to the same pod several times. However, the risk is low given the large number of replicas. Ingress doesn't support using GRPC header as a variable.

Starting point is 00:08:13 but the team found a workaround, using path rewriting and providing the required hash key in the path itself. The solution was admittedly a bit, hacky, but it worked. Unfortunately, pinpointing the cause of latency degradation would have required considerable time, as well as observability improvements. A different approach was needed to scale the feature store in time. To meet the deadline, the team split the feature service into 27 different services and manually split all entities between them on the client. It wasn't the most elegant approach, but it was simple and practical, and it achieved great results. The cash hit rate improved to 95% and the SkyladyB load was reduced to 18. 4 million rows per second. With this design, ShareChat scaled its feature store to 1B features

Starting point is 00:09:00 per second by March. However, this old-school deployment splitting approach still wasn't the ideal design. Maintaining 27 deployments was tedious and inefficient. Plus, the cash hit rate wasn't stable, and scaling was limited by having to keep a high minimum pod count in every deployment. So even though this approach technically met their needs, the team continued their search for a better long-term solution. The next phase of optimizations, consistent hashing, feature service. Ready for yet another round of optimization, the team revisited the consistent a enthrashing approach using a sidecar called Envoy proxy deployed with the feature service. Envoy proxy provided better observability which helped identify the latency tail issue.

Starting point is 00:09:43 The problem, different request patterns to the feature service caused a huge load on the GRPC layer and cache. That led to extensive Mutex contention. The team then optimized the feature service. They forked the caching library, fast cache from Victoria Metrics, an implemented batch rights and better eviction to reduce Mutex contention by 100x. Forked GPRC go and implemented buffer pool across different connections to avoid contention during high parallelism. Used object pooling and tuned garbage collector, GC, parameters to reduce allocation rates and GC cycles. With Envoy proxy handling 15% of traffic in their proof of concept, the results were promising. A 98% cash hit rate, which reduced the load on Skyladyby to 7.

Starting point is 00:10:31 For Moros, SEC, they could even scale the feature store more, from 1 billion features per second to 3 billion features per second. Lessons learned from scaling a high-performance feature store. Here's what this journey looked like from a timeline perspective. To close, Andre summed up the team's top lessons learned from this project. So far, use proven technologies. Even as the share chat team drastically changed their system design, Skyladyby, Apache Flink and Victoria Metrics continued working well. Each optimization is harder than the previous one and has less impact. Simple and practical solutions, such as splitting the feature store into 27 deployments, do indeed work.

Starting point is 00:11:12 The solution that delivers the best performance isn't always user-friendly. For instance, their revised database schema yields good performance, but is difficult to maintain and understand. Ultimately, they wrote some tooling around it to make it simpler to work with. Every system is unique. Sometimes you might need to fork a default live. and adjust it for your specific system to get the best performance. Thank you for listening to this Hackernoon story, read by artificial intelligence.

Starting point is 00:11:39 Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - How ShareChat Scaled their ML Feature Store 1000X without Scaling the Database

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.