The Good Tech Companies - How Tripadvisor Delivers Real-Time Personalization at Scale with ML
Episode Date: July 22, 2025This story was originally published on HackerNoon at: https://hackernoon.com/how-tripadvisor-delivers-real-time-personalization-at-scale-with-ml. Tripadvisor uses ML and... ScyllaDB on AWS to deliver real-time personalization at massive scale with millisecond latency and advanced data architecture. Check more stories related to machine-learning at: https://hackernoon.com/c/machine-learning. You can also check exclusive content about #tripadvisor-personalization, #real-time-ml, #scylladb-aws, #machine-learning-architecture, #ml-feature-store, #cassandra-migration, #kubernetes-microservices, #good-company, and more. This story was written by: @scylladb. Learn more about this writer by checking @scylladb's about page, and for more stories, please visit hackernoon.com. Tripadvisor delivers real-time personalization to over 400M monthly users using ML models powered by ScyllaDB on AWS. Their Visitor Platform processes billions of daily requests with millisecond latency, leveraging a custom feature store, Kubernetes microservices, and data pipelines. Migrating from Cassandra to ScyllaDB boosted performance and reduced operational overhead.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
How TripAdvisor delivers real-time personalization at scale with ML
by SkylaDB
See the engineering behind real-time personalization at TripAdvisor's massive,
and rapidly growing, scale what kind of traveler are you?
TripAdvisor tries to assess this as soon as you engage with the site,
then offer you increasingly relevant information on every click, within a matter of milliseconds. This personalization is powered by advanced ML
models acting on data that's stored on Silidib running on AWS. In this article, Dean Pulan,
TripAdvisor data engineering lead on the AI service and products team, provides a look at how they
power this personalization. Dean shares a taste of the technical challenges involved in delivering real-time personalization
at TripAdvisor's massive, and rapidly growing, scale.
It's based on the following AWS RE!
Invent Talk, Pre-Trip Orientation.
In Dean's words, let's start with a quick snapshot of who TripAdvisor is, and the scale
at which we operate. Founded in 2000, TripAdvisor has become a global leader in travel and hospitality,
helping hundreds of millions of travelers plan their perfect trips.
TripAdvisor generates over $1, $8 billion in revenue and is a publicly traded company on
the Nasdaq stock exchange. Today, we have a talented team of over 2,800 employees driving innovation, and our platform
serves a staggering 400 million unique visitors per month.
A number that's continuously growing.
On any given day, our system handles more than 2 billion requests from 25 to 50 million
users.
Every click you make on TripAdvisor is processed in real time.
Behind that, we're leveraging machine learning
models to deliver personalized recommendations, getting you closer to that perfect trip.
At the heart of this personalization engine is Silidib running on AWS.
This allows us to deliver millisecond latency at a scale that few organizations reach.
At peak traffic, we HIT around 425k operations per second on Silidibi with P99 latencies for
reads and writes around 1-3 milliseconds. I'll be sharing how TripAdvisor is harnessing the
power of Silidibi, AWS, and Real-Time Machine Learning to deliver personalized recommendations
for Evra user. We'll explore how we help travelers discover everything they need to
plan their perfect trip. Whether it's uncovering hidden gems, must-see attractions,
unforgettable experiences, or the best places to stay and dine.
This article is about the engineering behind that.
How we deliver seamless, relevant contento users in real time,
helping them find exactly what they're looking for as quickly as possible.
Personalized Trip Planning
Imagine you're planning
a trip. As soon as you land on the TripAdvisor homepage, TripAdvisor already knows whether
you're a foodie, an adventurer, or a beach lover, and you're seeing spot on recommendations that
seem personalized to your own interests. How does that happen within milliseconds? As you browse
around TripAdvisor, we start to personalize what you see using machine learning models which calculate scores based on your current and prior browsing activity.
We recommend hotels and experiences that we think you would be interested in.
We sort hotels based on your personal preferences.
We recommend popular points of interest near the hotel you're viewing.
These aerial tuned based on your own personal preferences and prior browsing activity.
TripAdvisor's model serving architecture. These are all tuned based on your own personal preferences and prior browsing activity.
TripAdvisor's model serving architecture.
TripAdvisor runs on hundreds of independently scalable microservices in Kubernetes on-prem
and in Amazon EKS.
Our ML model serving platform is exposed through one of these microservices.
This gateway service abstracts over 100 mega liters models from the client services which
lets us run A-B tests to find the best models using our experimentation platform. The ML models are
primarily developed by our data scientists and machine learning engineers using Jupyter
notebooks on Kubeflow. They're managed on trained using MLflow, and we deploy them on
Selden Core in Kubernetes. Our custom feature store provides features to our ML models, enabling them to make accurate predictions.
The custom feature store.
The feature store primarily serves user features and static features.
Static features are stored in Redis because they don't change very often.
We run data pipelines daily to load data from our offline data warehouse into our feature store as static features.
User features are served in real time through a platform called visitor platform.
We execute dynamic CQL queries against Silidibi, and we do not need a caching
layer because Silidibi is so fast. Our feature store serves up to 5 million
static features per second and half a million user features per second.
What's an ML feature? Features are input variables to the ML models that are used to make a prediction.
There are static features and user features. Some examples of static features are awards that a
restaurant has won or amenities offered by a hotel, like free wifi, pet friendly or fitness center.
User features are collected in real time as users browse around the site.
User features are collected in real time as users browse around the site. Westor them in SillyDB so we can get lightning fast queries.
Some examples of user features are the hotels viewed over the last 30 minutes,
restaurants viewed over the last 24 hours, or reviews submitted over the last 30 days.
The technology's powering visitor platform.
SillyDB is at the core of visitor platform.
We use Java-based Spring Boot microservices to expose the platform to our clients.
This is deployed on AWS ECS Fargate.
We run Apache Spark on Kubernetes for our daily data retention jobs, our offline to online jobs.
Then we use those jobs to load data from our offline data warehouse into Silidibi so that they're available on the live site.
We outsource Amazon Kinesis for processing streaming user tracking events.
The visitor platform Dataflow. The following graphic shows how data flows through our platform
in four stages, produce, ingest, organize, and activate. Data is produced by our website
and our mobile apps. Some of that data includes our cross-device user identity graph,
behavior tracking events, like page views and clicks, and streaming events that go through Kinesis.
Also, audience segmentation gets loaded into our platform.
Visitor Platform's microservices are used to ingest and organize this data.
The data in Silidib is stored in two key spaces.
The Visitor Core Keyspace, which contains the visitor identity graph. The Visitor Metric Keyspace, which contains facts and metrics, the things that the people did as they browsed the site.
We use daily ETL processes to maintain and clean up the data in the platform.
We produce data products, stamped daily, in our offline data warehouse, where they are available for other integrations and other data pipelines to use in their processing.
Here's a look at visitor platform by the numbers.
Why two databases?
Our online database is focused on the real-time, live website traffic.
Cilid B fills this role by providing very low latencies and high throughput.
We use short-term TTLs to prevent the data in the online database from
growing indefinitely, and our data retention jobs ensure that we only keep user activity
data for real visitors. TripAdvisor, com gets a lot of bot traffic, and we don't want to
store their data and try to personalize bots, so we delete and clean upall that data. Our
offline data warehouse retains historical data used for reporting, creating other data products, and training our ML models.
We don't want large-scale offline data processes impacting the performance of our live site, so we have two separate databases used for two different purposes.
Visitor Platform Microservices
We use five microservices for Visitor Platform.
Visitor Core manages the cross-device user identity graph based
on cookies and device IDs. Visitor metric is our query engine and that provides us with the ability
for exposing facts and metrics for specific visitors. We use a domain specific language
called visitor query language or VQL. This example VQL lets you see the latest commerce
click facts over the last three hours.
Visitor Publisher and Visitor Saver handle the right path, writing data into the platform.
Besides saving data in Silidibi, we also stream data to the offline data warehouse.
That's done with Amazon Kinesis.
Visitor Composite simplifies publishing data in batch processing jobs.
It abstracts Visitor Saver and Vis visitor core to identify visitors and publish facts and
metrics in a single API call.
Roundtrip microservice latency.
This graph illustrates how our microservice latencies remain stable over time.
The average latency is only 2.5 milliseconds, and our P999 is under 12.5 milliseconds.
This is impressive performance, especially given that we handle over 1 billion requests per day.
Our microservice clients have strict latency requirements.
95% of the calls must complete in 12 milliseconds or less.
If they go over that, then we will get paged and have to find out what's impacting the latencies.
Cilidibie Latency
Here's a snapshot of Cilidibie's performance
over 3 days.
At peak, Cilidibie is handling 340,000 operations per second, including writesand reads and
deletes, and the CPU is hovering at just 21%.
This is high-scaline action.
Cilidibie delivers microsecond writes and millisecond reads for us.
This level of blazing fast performance is exactly why we chose Silidibi.
Partitioning data into Silidibi. This image shows how we partition data into Silidibi. The visitor metric key space has two tables, fact and raw metrics. The primary key on the
fact table is visitor GUID, fact type, and created a date. The composite partition key
is the visitor GUID and fact type. The clustering key is created a date, which allows us to sort
data in partitions by date. The attributes column contains a JSON object representing the event
that occurred there. Some example facts are search terms, page views, and bookings. We use Silidibi's leveled compaction strategy because it's optimized for range queries.
It handles high cardinality very well. It's better for read-heavy workloads,
and we have about 2-3x more reads than writes.
Why Silidibi? Our solution was originally built using Cassandra on-prem,
but as the scale increased, so did the operational burden.
It required dedicated operations support in order for us to manage the database upgrades,
backups, etc. Also, our solution requires very low latencies for core components.
Our user identity management system must identify the user within 30 milliseconds,
and for the best personalization, we require our event tracking platform to respond in 40ms.
It's critical that our solution doesn't block rendering the page so our SLAs are very low.
With Cassandra, we had impacts to performance from garbage collection.
That was primarily impacting the tail latencies, the P999 and P9999 latencies.
We ran a proof of concept with Cilidibi
and found the throughput to be much better than Cassandra
and the operational burden was eliminated.
Cilidibi gavias a monstrously fast live serving database
with the lowest possible latencies.
We wanted a fully managed option,
so we migrated from Cassandra to SkylaDB cloud
following a dual write strategy.
That allowed us to migrate with zero downtime
while handling 40,000 operations or requests per second.
Later, we emigrated from SillyDB cloud to SillyDB's bring your own account model, where
you can have the SillyDB team deploy the SillyDB database into your own AWS account. This gave
us improved performance as well as better data privacy.
This diagram shows what SillyDB's BYOA deployment looks like.
In the center of the diagram, you can see a six-node Silidibi cluster that is running on EC2.
And then there's two additional EC2 instances.
Silidibi Monitor gives us Grafana dashboards as well as Prometheus metrics.
Silidibi Manager takes care of infrastructure automation like triggering backups and repairs.
With this deployment, Silidibi could be co-located very close to our microservices to give us
even lower latencies as well as much higher throughput and performance.
Wrapping up, I hope you now have a better understanding of our architecture, the technologies
that power the platform, and how Silidibi plays a critical role in allowing us to handle TripAdvisor's extremely high-scale.
About Cynthia Dunlop, Cynthia is Senior Director of Content Strategy at Siladibi.
She has been writing about software development and quality engineering for 20-plus years.
Thank you for listening to this Hacker Noon story, read by Artificial Intelligence.
Visit HackerNoon.com to read, write, learn and publish.
