The Good Tech Companies - Inside Tencent Games’ Real-Time Event-Driven Analytics System
Episode Date: February 26, 2026This story was originally published on HackerNoon at: https://hackernoon.com/inside-tencent-games-real-time-event-driven-analytics-system. Tencent Games built a real-tim...e CQRS analytics system with Pulsar and ScyllaDB to power global gameplay monitoring and risk control. Check more stories related to data-science at: https://hackernoon.com/c/data-science. You can also check exclusive content about #tencent-games-analytics, #cqrs-gaming-event-sourcing, #scylladb-time-series-event, #apache-pulsar-gaming-event, #scylladb-keyspace-replication, #timewindow-compaction-strategy, #real-time-data-pipeline, #good-company, and more. This story was written by: @scylladb. Learn more about this writer by checking @scylladb's about page, and for more stories, please visit hackernoon.com. Tencent Games engineered a real-time, event-driven analytics system using CQRS and event sourcing with Apache Pulsar and ScyllaDB. The platform processes massive gameplay events, dispatches time-series data efficiently, and replicates keyspaces globally for compliance and multi-region control. The result is scalable fan-in/fan-out event streaming that powers anti-cheating and risk detection across millions of players worldwide.
Transcript
Discussion (0)
This audio is presented by Hacker Noon, where anyone can learn anything about any technology.
Inside Tencent Games Real-Time Event-Driven analytics system by Skyladyby.
A look at how Tencent Games built service architecture based on CQRS and event sourcing patterns
with Pulsar and Skyladyby.
As a part of Tencent Interactive Entertainment Group Global, IEG Global, Proxima Beta is committed
to supporting our teams and studios to bring unique, exhilarating games to millions of players
around the world. You might be familiar with some of our current games, such as Pubgy Mobile, Arena of
Valor, and Tower of Fantasy. Our team at level infinite, the brand for global publishing, is responsible
for managing a wide range of risks to our business, for example, cheating activities and harmful content.
From a technical perspective, this required us to build an efficient real-time analytics system to
consistently monitor all kinds of activities in our business domain. In this blog, we share our experience of
building this real-time event-driven analytics system. First, we'll explore why we built our
service architecture based on command and query responsibility segregation, CQRS, and event sourcing
patterns with Apache Pulsar and SkylaDB. Next, we'll look at how we use SkylaDB to solve the
problem of dispatching events to numerous gameplay sessions. Finally, we'll cover how we use SkylaDB
key spaces and data replication to simplify our global data management. A peek at the use case, a
addressing risks in Tencent games. Let's start with a real-world example of what we're working with
and the challenges we face. This is a screenshot from Tower of Fantasy, a 3D action role-playing game.
Players can use this dialogue to file a report against another player for various reasons.
If you were to use a typical crud system for it, how would you keep those records for follow-ups?
And what are the potential problems? The first challenge would be determining which team is going to own the database to store this form.
There are different reasons to make a report, including an option called others, so a case might be
handled by different functional teams. However, there is not a single functional team in our
organization Thadjohn fully own the form. That's why it is a natural choice for us to capture
this case as an event, like, report a case. All the information is captured in this event as is.
All functional teams only need to subscribe to this event and do their own filtering. If they think
the case falls into their domain, they can just capture it and trigger further actions. CQRS and
event sourcing. The service architecture behind this example is based on the CQRS and event
sourcing patterns. If these terms are new to you, don't worry. By the end of this overview,
you should have a solid understanding of these concepts. And if you want more detail at that point,
take a look at our blog dedicated to this topic. The first concept to understand here is event
sourcing. The core idea behind event sourcing is that every change to a system's state is captured
in an event object and these event objects are stored in the order in which they were applied to the
system state. In other words, instead of just storing the current state, we use an append only
store to record the entire series of actions tacken on that state. This concept is simple but powerful
as the events that represent every action are recorded so that any possible model describing
the system can be built from the events. The next concept is CQRS,
which stands for command query responsibility segregation. CQRS was coined by Greg Young over a decade
ago and originated from the command and query separation principle. The fundamental idea is to create
separate data models for reads and writes rather than using the same model forbath purposes.
By following the CQRS pattern, every API should either be a command that performs an action
or a query that returns data to the caller but not both. This naturally divides the system into two parts.
the right side and the read side. This separation offers several benefits. For example,
we can scale right Android capacity independently for optimizing cost efficiency. From a teamwork
perspective, different teams can create different views of the same data with fewer conflicts.
The high-level workflow of the right side can be summarized as follows. Events that occur
in numerous gameplay sessions are fed into a limited number of event processors. The implementation is
also straightforward, typically involving a message bus such as pulsar, Kafka, or a simpler Q
system that acts as an event store. Events from clients are persisted in the event store by topic
and event processors consume events by subscribing to topics. If you're interested in why we chose Apache
Pulsar over other systems, you can find more information in the blog referenced earlier.
Although Q-like systems are usually efficient at handling traffic that flows in one direction,
E. G. Fan in. They may not be as effective at handling traffic that flows in the opposite direction.
E. G. Fan out. In our scenario, the number of gameplay sessions will be large, and a typical
queue system doesn't fit well since we can't afford to create a dedicated queue for every
gameplay session. We need to find a practical way to distribute findings and metrics to individual
gameplay sessions through query APIs. This is why we use Skyladyby to build another queue like
event store, which is optimal.
for event fan out. We will discuss this further in the next section. Before we move on,
here's a summary of our service architecture. Starting from the right side, game servers keep
sending events to our system through command endpoints and each event represents a certain
kind of activity that occurred in a gameplay session. Event processors produce findings or metrics
against the event streams of each gameplay session and act as a bridge between two sides. On the
read side, we have game servers or other clients that keep pulling metrics and findings.
through query endpoints and take further actions if abnormal activities have been observed.
Distributed Q-like event store for time series events. Now let's look at how we use Skyladyby
to solve the problem of dispatching events to numerous gameplay sessions. By the way, if you Google,
Cassandra, and Q, you may come across an article from over a decade ago stating that using
Cassandra as a Q is an anti-pattern. While this might have been true at that time, I would argue
that it is only partially true today. We made it work.
with Skylidibi, which is Cassandra compatible. To support the dispatch of events to each gameplay
session, we use the session it as the partition key so that each gameplay session has its own
partition and events belonging to a particular gameplay session can be located by the session
it efficiently. Each event also has a unique event id, which is a time UUID as the clustering key.
Because records within the same partition are sorted by the clustering key, the event id can be
used as the position id in a queue. Finally, Skylid B clients can efficiently retrieve newly
arrived events by tracking the event id off the most recent event that has been received.
There is one caveat to keep in mind when using this approach, the consistency problem.
Retrieving new events by tracking the most recent event id relies on the assumption that no event
with a smaller id will be committed in the future. However, this assumption may not always hold
true. For example, if two nodes generate two event identifiers at the same time, an event with a
smaller ID might be inserted later than an event with a larger id. This problem, which I refer to as a
phantom read, is similar to the phenomenon in the sequel world where repeating the same query
can yield different results duetto uncommitted changes made by another transaction. However, the root cause of the
problem in our case is different. It occurs when events are committed to Skyla DB out of the order
indicated by the event id. There are several ways to address this issue. One solution is to maintain
a cluster-wide status, which I call a pseudo-now, based on the smallest value of the moving
timestamps among all event processors. Each event processor should also ensure that all future events
have an event id greater than its current timestamp. Another important consideration is enabling
time window compaction strategy, which eliminates the negative performance impact caused by tombstones.
Accumulation of tombstones was a major issue that prevented the use of Cassandra as a queue before time window compaction strategy became available.
Now let's shift to discussing other benefits beyond using Skyladyby as a DIS patching queue.
Simplifying complex global data distribution challenges.
Since we are building a multi-tenancy system to serve customers around the world, it is essential to ensure that customer configurations are consistent across clusters in different regions.
Trust is, keeping a destitute.
system consistent is not a trivial task if you plan to do it all by yourself. We solved this
problem by simply enabling data replication on a key space across all data centers. This means any
change made in one data center will eventually propagate to others. Thanks SkylidiB, as well as DynamoDB
and Cassandra, for the heavy lifting that makes this challenging problem seem trivial. You might be thinking that
using any typical RDMS could achieve the same result since most databases also support data
replication. This is true if there is only one instance of the control panel running in a given
region. In a typical primary replica architecture, only the primary node supports read, write while
replica nodes are read only. However, when you need to run multiple instances of the control panel
across different regions, for example, every tenant has a control panel running in its home region,
or even every region has a control panel running for local teams, it becomes much more difficult to
implement this using a typical primary replica architecture. If you have used a WS DynamoDB,
you may be familiar with a feature called Global Table, which allows applications to read and write
locally and access the data globally. Enabling replication on key spaces with Skyladyby provides a
similar feature, but without vendor lock-in. You can easily extend global tables across a multi-cloud
environment. Keyspaces as data containers. Next, let's look at how we use key spaces as data containers to improve
the transparency of global data distribution. Let's take a look at the diagram below. It shows a
a solution to a typical data distribution problem imposed by data protection laws. For example,
suppose Region A allows certain types of data to be processed outside of its borders a slong as an
original copy is kept in its region. As a product owner, how can you ensure that all your
applications comply with this regulation? Asterisk. One potential solution is to perform end-to-end,
E2E tests to ensure that applications correctly send the correct data to the correct region as expected.
This approach requires application developers to take full responsibility for implementing data
distribution correctly. However, as the number of applications grows,
it becomes impractical for each application to handle this problem individually and E2E tests
also become increasingly expensive in terms of both time and money.
Let's think twice about this problem. By enabling data replication on key spaces, we can divide
we can divide the responsibility for correctly distributing data into two tasks.
One, identifying data types and declaring their destinations,
and two, copying or moving data to the expected locations.
By separating these two duties, we can abstract away complex configurations and regulations from applications.
This is because the process of transferring data to another region is often the most complicated part to deal with,
such aspassing through network boundaries, correctly encrypting traffic, and handling interruptions.
After separating these two duties, applications are only required to correctly perform the first step,
which is much easier to verify through testing at earlier stages of the development cycle.
Additionally, the correctness of configurations for data distribution becomes much easier to verify and audit.
You can simply check the settings of key spaces to see where data is going.
Tips for others taking a similar path.
To conclude, we'll leave you with important lessons that we learned,
and that wherecomend you apply if you end up taking a path similar to ours.
When using Skyladybee to handle time series data, such as using it as an event dispatching queue,
remember to use the time window compaction strategy. Consider using key spaces as data containers to
separate the responsibility of data distribution. This can make complex data distribution problems
much easier to manage. Watch tech talks on demand. This article is based on a tech talk presented
at SkylaDB Summit 2023. You watch this talk, as well as talks by engineers from Discord, Epic Games,
Disney, Strava, ShareChat and more on demand. Watch Tech Talks on Demand. Thank you for listening to
this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and
publish.
