The Good Tech Companies - How Tripadvisor Delivers Real-Time Personalization at Scale with ML

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. How TripAdvisor delivers real-time personalization at scale with ML by SkylaDB See the engineering behind real-time personalization at TripAdvisor's massive, and rapidly growing, scale what kind of traveler are you? TripAdvisor tries to assess this as soon as you engage with the site, then offer you increasingly relevant information on every click, within a matter of milliseconds. This personalization is powered by advanced ML models acting on data that's stored on Silidib running on AWS. In this article, Dean Pulan,

Starting point is 00:00:37 TripAdvisor data engineering lead on the AI service and products team, provides a look at how they power this personalization. Dean shares a taste of the technical challenges involved in delivering real-time personalization at TripAdvisor's massive, and rapidly growing, scale. It's based on the following AWS RE! Invent Talk, Pre-Trip Orientation. In Dean's words, let's start with a quick snapshot of who TripAdvisor is, and the scale at which we operate. Founded in 2000, TripAdvisor has become a global leader in travel and hospitality, helping hundreds of millions of travelers plan their perfect trips.

Starting point is 00:01:13 TripAdvisor generates over $1, $8 billion in revenue and is a publicly traded company on the Nasdaq stock exchange. Today, we have a talented team of over 2,800 employees driving innovation, and our platform serves a staggering 400 million unique visitors per month. A number that's continuously growing. On any given day, our system handles more than 2 billion requests from 25 to 50 million users. Every click you make on TripAdvisor is processed in real time. Behind that, we're leveraging machine learning

Starting point is 00:01:45 models to deliver personalized recommendations, getting you closer to that perfect trip. At the heart of this personalization engine is Silidib running on AWS. This allows us to deliver millisecond latency at a scale that few organizations reach. At peak traffic, we HIT around 425k operations per second on Silidibi with P99 latencies for reads and writes around 1-3 milliseconds. I'll be sharing how TripAdvisor is harnessing the power of Silidibi, AWS, and Real-Time Machine Learning to deliver personalized recommendations for Evra user. We'll explore how we help travelers discover everything they need to plan their perfect trip. Whether it's uncovering hidden gems, must-see attractions,

Starting point is 00:02:27 unforgettable experiences, or the best places to stay and dine. This article is about the engineering behind that. How we deliver seamless, relevant contento users in real time, helping them find exactly what they're looking for as quickly as possible. Personalized Trip Planning Imagine you're planning a trip. As soon as you land on the TripAdvisor homepage, TripAdvisor already knows whether you're a foodie, an adventurer, or a beach lover, and you're seeing spot on recommendations that

Starting point is 00:02:55 seem personalized to your own interests. How does that happen within milliseconds? As you browse around TripAdvisor, we start to personalize what you see using machine learning models which calculate scores based on your current and prior browsing activity. We recommend hotels and experiences that we think you would be interested in. We sort hotels based on your personal preferences. We recommend popular points of interest near the hotel you're viewing. These aerial tuned based on your own personal preferences and prior browsing activity. TripAdvisor's model serving architecture. These are all tuned based on your own personal preferences and prior browsing activity. TripAdvisor's model serving architecture.

Starting point is 00:03:27 TripAdvisor runs on hundreds of independently scalable microservices in Kubernetes on-prem and in Amazon EKS. Our ML model serving platform is exposed through one of these microservices. This gateway service abstracts over 100 mega liters models from the client services which lets us run A-B tests to find the best models using our experimentation platform. The ML models are primarily developed by our data scientists and machine learning engineers using Jupyter notebooks on Kubeflow. They're managed on trained using MLflow, and we deploy them on Selden Core in Kubernetes. Our custom feature store provides features to our ML models, enabling them to make accurate predictions.

Starting point is 00:04:08 The custom feature store. The feature store primarily serves user features and static features. Static features are stored in Redis because they don't change very often. We run data pipelines daily to load data from our offline data warehouse into our feature store as static features. User features are served in real time through a platform called visitor platform. We execute dynamic CQL queries against Silidibi, and we do not need a caching layer because Silidibi is so fast. Our feature store serves up to 5 million static features per second and half a million user features per second.

Starting point is 00:04:43 What's an ML feature? Features are input variables to the ML models that are used to make a prediction. There are static features and user features. Some examples of static features are awards that a restaurant has won or amenities offered by a hotel, like free wifi, pet friendly or fitness center. User features are collected in real time as users browse around the site. User features are collected in real time as users browse around the site. Westor them in SillyDB so we can get lightning fast queries. Some examples of user features are the hotels viewed over the last 30 minutes, restaurants viewed over the last 24 hours, or reviews submitted over the last 30 days. The technology's powering visitor platform.

Starting point is 00:05:20 SillyDB is at the core of visitor platform. We use Java-based Spring Boot microservices to expose the platform to our clients. This is deployed on AWS ECS Fargate. We run Apache Spark on Kubernetes for our daily data retention jobs, our offline to online jobs. Then we use those jobs to load data from our offline data warehouse into Silidibi so that they're available on the live site. We outsource Amazon Kinesis for processing streaming user tracking events. The visitor platform Dataflow. The following graphic shows how data flows through our platform in four stages, produce, ingest, organize, and activate. Data is produced by our website

Starting point is 00:06:00 and our mobile apps. Some of that data includes our cross-device user identity graph, behavior tracking events, like page views and clicks, and streaming events that go through Kinesis. Also, audience segmentation gets loaded into our platform. Visitor Platform's microservices are used to ingest and organize this data. The data in Silidib is stored in two key spaces. The Visitor Core Keyspace, which contains the visitor identity graph. The Visitor Metric Keyspace, which contains facts and metrics, the things that the people did as they browsed the site. We use daily ETL processes to maintain and clean up the data in the platform. We produce data products, stamped daily, in our offline data warehouse, where they are available for other integrations and other data pipelines to use in their processing.

Starting point is 00:06:47 Here's a look at visitor platform by the numbers. Why two databases? Our online database is focused on the real-time, live website traffic. Cilid B fills this role by providing very low latencies and high throughput. We use short-term TTLs to prevent the data in the online database from growing indefinitely, and our data retention jobs ensure that we only keep user activity data for real visitors. TripAdvisor, com gets a lot of bot traffic, and we don't want to store their data and try to personalize bots, so we delete and clean upall that data. Our

Starting point is 00:07:20 offline data warehouse retains historical data used for reporting, creating other data products, and training our ML models. We don't want large-scale offline data processes impacting the performance of our live site, so we have two separate databases used for two different purposes. Visitor Platform Microservices We use five microservices for Visitor Platform. Visitor Core manages the cross-device user identity graph based on cookies and device IDs. Visitor metric is our query engine and that provides us with the ability for exposing facts and metrics for specific visitors. We use a domain specific language called visitor query language or VQL. This example VQL lets you see the latest commerce

Starting point is 00:08:03 click facts over the last three hours. Visitor Publisher and Visitor Saver handle the right path, writing data into the platform. Besides saving data in Silidibi, we also stream data to the offline data warehouse. That's done with Amazon Kinesis. Visitor Composite simplifies publishing data in batch processing jobs. It abstracts Visitor Saver and Vis visitor core to identify visitors and publish facts and metrics in a single API call. Roundtrip microservice latency.

Starting point is 00:08:32 This graph illustrates how our microservice latencies remain stable over time. The average latency is only 2.5 milliseconds, and our P999 is under 12.5 milliseconds. This is impressive performance, especially given that we handle over 1 billion requests per day. Our microservice clients have strict latency requirements. 95% of the calls must complete in 12 milliseconds or less. If they go over that, then we will get paged and have to find out what's impacting the latencies. Cilidibie Latency Here's a snapshot of Cilidibie's performance

Starting point is 00:09:07 over 3 days. At peak, Cilidibie is handling 340,000 operations per second, including writesand reads and deletes, and the CPU is hovering at just 21%. This is high-scaline action. Cilidibie delivers microsecond writes and millisecond reads for us. This level of blazing fast performance is exactly why we chose Silidibi. Partitioning data into Silidibi. This image shows how we partition data into Silidibi. The visitor metric key space has two tables, fact and raw metrics. The primary key on the fact table is visitor GUID, fact type, and created a date. The composite partition key

Starting point is 00:09:46 is the visitor GUID and fact type. The clustering key is created a date, which allows us to sort data in partitions by date. The attributes column contains a JSON object representing the event that occurred there. Some example facts are search terms, page views, and bookings. We use Silidibi's leveled compaction strategy because it's optimized for range queries. It handles high cardinality very well. It's better for read-heavy workloads, and we have about 2-3x more reads than writes. Why Silidibi? Our solution was originally built using Cassandra on-prem, but as the scale increased, so did the operational burden. It required dedicated operations support in order for us to manage the database upgrades,

Starting point is 00:10:29 backups, etc. Also, our solution requires very low latencies for core components. Our user identity management system must identify the user within 30 milliseconds, and for the best personalization, we require our event tracking platform to respond in 40ms. It's critical that our solution doesn't block rendering the page so our SLAs are very low. With Cassandra, we had impacts to performance from garbage collection. That was primarily impacting the tail latencies, the P999 and P9999 latencies. We ran a proof of concept with Cilidibi and found the throughput to be much better than Cassandra

Starting point is 00:11:07 and the operational burden was eliminated. Cilidibi gavias a monstrously fast live serving database with the lowest possible latencies. We wanted a fully managed option, so we migrated from Cassandra to SkylaDB cloud following a dual write strategy. That allowed us to migrate with zero downtime while handling 40,000 operations or requests per second.

Starting point is 00:11:29 Later, we emigrated from SillyDB cloud to SillyDB's bring your own account model, where you can have the SillyDB team deploy the SillyDB database into your own AWS account. This gave us improved performance as well as better data privacy. This diagram shows what SillyDB's BYOA deployment looks like. In the center of the diagram, you can see a six-node Silidibi cluster that is running on EC2. And then there's two additional EC2 instances. Silidibi Monitor gives us Grafana dashboards as well as Prometheus metrics. Silidibi Manager takes care of infrastructure automation like triggering backups and repairs.

Starting point is 00:12:07 With this deployment, Silidibi could be co-located very close to our microservices to give us even lower latencies as well as much higher throughput and performance. Wrapping up, I hope you now have a better understanding of our architecture, the technologies that power the platform, and how Silidibi plays a critical role in allowing us to handle TripAdvisor's extremely high-scale. About Cynthia Dunlop, Cynthia is Senior Director of Content Strategy at Siladibi. She has been writing about software development and quality engineering for 20-plus years. Thank you for listening to this Hacker Noon story, read by Artificial Intelligence. Visit HackerNoon.com to read, write, learn and publish.

The Good Tech Companies - How Tripadvisor Delivers Real-Time Personalization at Scale with ML

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.