The Good Tech Companies - Inside Tencent Games’ Real-Time Event-Driven Analytics System

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Inside Tencent Games Real-Time Event-Driven analytics system by Skyladyby. A look at how Tencent Games built service architecture based on CQRS and event sourcing patterns with Pulsar and Skyladyby. As a part of Tencent Interactive Entertainment Group Global, IEG Global, Proxima Beta is committed to supporting our teams and studios to bring unique, exhilarating games to millions of players around the world. You might be familiar with some of our current games, such as Pubgy Mobile, Arena of Valor, and Tower of Fantasy. Our team at level infinite, the brand for global publishing, is responsible

Starting point is 00:00:40 for managing a wide range of risks to our business, for example, cheating activities and harmful content. From a technical perspective, this required us to build an efficient real-time analytics system to consistently monitor all kinds of activities in our business domain. In this blog, we share our experience of building this real-time event-driven analytics system. First, we'll explore why we built our service architecture based on command and query responsibility segregation, CQRS, and event sourcing patterns with Apache Pulsar and SkylaDB. Next, we'll look at how we use SkylaDB to solve the problem of dispatching events to numerous gameplay sessions. Finally, we'll cover how we use SkylaDB key spaces and data replication to simplify our global data management. A peek at the use case, a

Starting point is 00:01:27 addressing risks in Tencent games. Let's start with a real-world example of what we're working with and the challenges we face. This is a screenshot from Tower of Fantasy, a 3D action role-playing game. Players can use this dialogue to file a report against another player for various reasons. If you were to use a typical crud system for it, how would you keep those records for follow-ups? And what are the potential problems? The first challenge would be determining which team is going to own the database to store this form. There are different reasons to make a report, including an option called others, so a case might be handled by different functional teams. However, there is not a single functional team in our organization Thadjohn fully own the form. That's why it is a natural choice for us to capture

Starting point is 00:02:12 this case as an event, like, report a case. All the information is captured in this event as is. All functional teams only need to subscribe to this event and do their own filtering. If they think the case falls into their domain, they can just capture it and trigger further actions. CQRS and event sourcing. The service architecture behind this example is based on the CQRS and event sourcing patterns. If these terms are new to you, don't worry. By the end of this overview, you should have a solid understanding of these concepts. And if you want more detail at that point, take a look at our blog dedicated to this topic. The first concept to understand here is event sourcing. The core idea behind event sourcing is that every change to a system's state is captured

Starting point is 00:02:56 in an event object and these event objects are stored in the order in which they were applied to the system state. In other words, instead of just storing the current state, we use an append only store to record the entire series of actions tacken on that state. This concept is simple but powerful as the events that represent every action are recorded so that any possible model describing the system can be built from the events. The next concept is CQRS, which stands for command query responsibility segregation. CQRS was coined by Greg Young over a decade ago and originated from the command and query separation principle. The fundamental idea is to create separate data models for reads and writes rather than using the same model forbath purposes.

Starting point is 00:03:39 By following the CQRS pattern, every API should either be a command that performs an action or a query that returns data to the caller but not both. This naturally divides the system into two parts. the right side and the read side. This separation offers several benefits. For example, we can scale right Android capacity independently for optimizing cost efficiency. From a teamwork perspective, different teams can create different views of the same data with fewer conflicts. The high-level workflow of the right side can be summarized as follows. Events that occur in numerous gameplay sessions are fed into a limited number of event processors. The implementation is also straightforward, typically involving a message bus such as pulsar, Kafka, or a simpler Q

Starting point is 00:04:24 system that acts as an event store. Events from clients are persisted in the event store by topic and event processors consume events by subscribing to topics. If you're interested in why we chose Apache Pulsar over other systems, you can find more information in the blog referenced earlier. Although Q-like systems are usually efficient at handling traffic that flows in one direction, E. G. Fan in. They may not be as effective at handling traffic that flows in the opposite direction. E. G. Fan out. In our scenario, the number of gameplay sessions will be large, and a typical queue system doesn't fit well since we can't afford to create a dedicated queue for every gameplay session. We need to find a practical way to distribute findings and metrics to individual

Starting point is 00:05:08 gameplay sessions through query APIs. This is why we use Skyladyby to build another queue like event store, which is optimal. for event fan out. We will discuss this further in the next section. Before we move on, here's a summary of our service architecture. Starting from the right side, game servers keep sending events to our system through command endpoints and each event represents a certain kind of activity that occurred in a gameplay session. Event processors produce findings or metrics against the event streams of each gameplay session and act as a bridge between two sides. On the read side, we have game servers or other clients that keep pulling metrics and findings.

Starting point is 00:05:45 through query endpoints and take further actions if abnormal activities have been observed. Distributed Q-like event store for time series events. Now let's look at how we use Skyladyby to solve the problem of dispatching events to numerous gameplay sessions. By the way, if you Google, Cassandra, and Q, you may come across an article from over a decade ago stating that using Cassandra as a Q is an anti-pattern. While this might have been true at that time, I would argue that it is only partially true today. We made it work. with Skylidibi, which is Cassandra compatible. To support the dispatch of events to each gameplay session, we use the session it as the partition key so that each gameplay session has its own

Starting point is 00:06:25 partition and events belonging to a particular gameplay session can be located by the session it efficiently. Each event also has a unique event id, which is a time UUID as the clustering key. Because records within the same partition are sorted by the clustering key, the event id can be used as the position id in a queue. Finally, Skylid B clients can efficiently retrieve newly arrived events by tracking the event id off the most recent event that has been received. There is one caveat to keep in mind when using this approach, the consistency problem. Retrieving new events by tracking the most recent event id relies on the assumption that no event with a smaller id will be committed in the future. However, this assumption may not always hold

Starting point is 00:07:07 true. For example, if two nodes generate two event identifiers at the same time, an event with a smaller ID might be inserted later than an event with a larger id. This problem, which I refer to as a phantom read, is similar to the phenomenon in the sequel world where repeating the same query can yield different results duetto uncommitted changes made by another transaction. However, the root cause of the problem in our case is different. It occurs when events are committed to Skyla DB out of the order indicated by the event id. There are several ways to address this issue. One solution is to maintain a cluster-wide status, which I call a pseudo-now, based on the smallest value of the moving timestamps among all event processors. Each event processor should also ensure that all future events

Starting point is 00:07:53 have an event id greater than its current timestamp. Another important consideration is enabling time window compaction strategy, which eliminates the negative performance impact caused by tombstones. Accumulation of tombstones was a major issue that prevented the use of Cassandra as a queue before time window compaction strategy became available. Now let's shift to discussing other benefits beyond using Skyladyby as a DIS patching queue. Simplifying complex global data distribution challenges. Since we are building a multi-tenancy system to serve customers around the world, it is essential to ensure that customer configurations are consistent across clusters in different regions. Trust is, keeping a destitute. system consistent is not a trivial task if you plan to do it all by yourself. We solved this

Starting point is 00:08:39 problem by simply enabling data replication on a key space across all data centers. This means any change made in one data center will eventually propagate to others. Thanks SkylidiB, as well as DynamoDB and Cassandra, for the heavy lifting that makes this challenging problem seem trivial. You might be thinking that using any typical RDMS could achieve the same result since most databases also support data replication. This is true if there is only one instance of the control panel running in a given region. In a typical primary replica architecture, only the primary node supports read, write while replica nodes are read only. However, when you need to run multiple instances of the control panel across different regions, for example, every tenant has a control panel running in its home region,

Starting point is 00:09:25 or even every region has a control panel running for local teams, it becomes much more difficult to implement this using a typical primary replica architecture. If you have used a WS DynamoDB, you may be familiar with a feature called Global Table, which allows applications to read and write locally and access the data globally. Enabling replication on key spaces with Skyladyby provides a similar feature, but without vendor lock-in. You can easily extend global tables across a multi-cloud environment. Keyspaces as data containers. Next, let's look at how we use key spaces as data containers to improve the transparency of global data distribution. Let's take a look at the diagram below. It shows a a solution to a typical data distribution problem imposed by data protection laws. For example,

Starting point is 00:10:12 suppose Region A allows certain types of data to be processed outside of its borders a slong as an original copy is kept in its region. As a product owner, how can you ensure that all your applications comply with this regulation? Asterisk. One potential solution is to perform end-to-end, E2E tests to ensure that applications correctly send the correct data to the correct region as expected. This approach requires application developers to take full responsibility for implementing data distribution correctly. However, as the number of applications grows, it becomes impractical for each application to handle this problem individually and E2E tests also become increasingly expensive in terms of both time and money.

Starting point is 00:10:53 Let's think twice about this problem. By enabling data replication on key spaces, we can divide we can divide the responsibility for correctly distributing data into two tasks. One, identifying data types and declaring their destinations, and two, copying or moving data to the expected locations. By separating these two duties, we can abstract away complex configurations and regulations from applications. This is because the process of transferring data to another region is often the most complicated part to deal with, such aspassing through network boundaries, correctly encrypting traffic, and handling interruptions. After separating these two duties, applications are only required to correctly perform the first step,

Starting point is 00:11:33 which is much easier to verify through testing at earlier stages of the development cycle. Additionally, the correctness of configurations for data distribution becomes much easier to verify and audit. You can simply check the settings of key spaces to see where data is going. Tips for others taking a similar path. To conclude, we'll leave you with important lessons that we learned, and that wherecomend you apply if you end up taking a path similar to ours. When using Skyladybee to handle time series data, such as using it as an event dispatching queue, remember to use the time window compaction strategy. Consider using key spaces as data containers to

Starting point is 00:12:10 separate the responsibility of data distribution. This can make complex data distribution problems much easier to manage. Watch tech talks on demand. This article is based on a tech talk presented at SkylaDB Summit 2023. You watch this talk, as well as talks by engineers from Discord, Epic Games, Disney, Strava, ShareChat and more on demand. Watch Tech Talks on Demand. Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Inside Tencent Games’ Real-Time Event-Driven Analytics System

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.