Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

Original: Li Sheng

Original Profile

Li Sheng


Graduated from: Guangxi Institute of Finance and Economics




System Data Technician

Ctrip.Internet.System Data Technician

This article is mostly


Middleware development and evolution of flexible architecture

Simultaneously optimize network and performance

1. Background

QMQ system latency message

Independence as a Service

Existing set is not limited

Solutions implemented by message providers

Its structure is shown in figure below

In the following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

I'll explain

QMQ Delayed Message Service Architecture

Delay message

After delivery from manufacturer to service with a delay

Store on server's local drive

When scheduling messages expire

Service Delay

To broker in real time for consumer consumption

Delay service uses master-slave architecture


Zone represents an availability zone

In general, this can be understood as IDC


Guaranteed after single zone failure

Pending scheduling of historical delivery

Typical message scheduling

Master and slave will be deployed in different Availability Zones


1.1 Painful points

This architecture mainly has following problems

1. The service has a state

Unable to flexibly expand and contract

2. After failure of main node

Master-slave switching required. Automatic or manual

3. No consistency coordinator to ensure data consistency

If business layer of message

Separate from storage layer

Appropriate evolution and coordinated development

Focus on areas they are good at


Messaging business layer can be stateless

Simple container transformation

Elastic expansion and contraction options

The storage tier represents distributed file storage services

High availability and data consistency guaranteed by storage services

1.2 Choosing a Distributed File Storage


To select a storage service

Beyond basic high availability data

External Consistency

Another important point

High fault tolerance and low O&M costs

The main feature of a distributed system

Nature is tolerance for partial node failure

In end

Any hardware or software failure

This is 100% inevitable


High fault tolerance and low O&M costs

This will be most important thing in our choice


Pulsar contributed to Apache from Yahoo Open Source

Because it was created in cloud

Low latency distributed messaging

Tags for Queues and Streaming Platforms

In open source community

Invoke sensation and stalking

After appropriate research

I found Pulsar


Separation of message business and storage architecture

Storage level

Another BookKeeper from Apache Open Source Foundation

Two. Accountant


As scalable

High fault tolerance

Low z distributiondelay

Strongly Consistent Storage Service Disabled

Some companies use it for production deployment

Best practices

Enable namenode instead of HDFS

Storing and consuming Pulsar messages

Object persistence and storage

2.1 Basic architecture

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

Basic architecture of BookKeeper

Zookeeper cluster is used for

Discovery of storage nodes and storage of meta-information

Provide strong consistency guarantee

Storage node of a bookmaker

Provide storage services


And while reading

Bookmaker nodes should not interact with each other

When does bookmaker start

Log in to Zookeeper cluster. Grant access to service

Client belongs

Thick client type

Responsible for communicating with Zookeeper

Direct communication between cluster and BookKeeper cluster

And according to meta information

Finish multiple copies. Make sure data can be reread

2.2 Main Functions

1. Basic concepts


Basic unit of data carrier

General ledger

Excerpt from collection of articles. Similar to file


A brief overview of registry. Physical storage nodes


collection of bookmakers

2. Reading and writing data

As it should

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

Reading and writing BookKeeper data

betting client

Hold ledger while creating it

You can later

Perform record write operation

feed entry

Distributed by enemble bookmaker

The record is numbered on client side

Each entry will be based on set number of copies

.Qw. Record success request

betting client

Opening already

The created workbook performs a read record operation

The order in which a record is read corresponds to order in which it is written

Reading from first replica by default

Retry from next replica after read failure

3. Data Consistency


A betting client that can write ledgers is called Writer


Distributed lock mechanism secures account

There is only one Writer in world

Writer uniqueness guaranteed

Data writing sequence

Maintain LAC in Writer memory

Last addition confirmed

Subject to Qw requirements

Update LAC


With next request or time

Save at bookmaker

When book is disabled

Save to metadata store

zookeeper or etcd.


A betting client that can read account books

Invoked as a reader

Account book

There can be any number of readers

LAC Strong Consistency

Different readers see a single view of data

Also re-read

This ensures data read consistency

4. Fault tolerance

Typical failure scenarios

Writer crashes or restarts, Bookie crashes

Write failed

The ledger cannot be closed

results in unknown LAC

Via ledger recovery mechanism

Close book

Restore LAC

Bookmaker error

Error writing entry

Via ensemble replacement mechanism


New Entry Route

Information in metadata store

Ensure timely successful recording of new data

Historical data

Via bookmaker recovery mechanism

Respect Qw copy requirements

Reduced reliability of historical data reading


Everything where copy is

All bookmaker nodes fail. It remains only to wait for repair

5. Load balancing

New extension

Bookmaker in a cluster

Traffic is automatically balanced when a new book is created

2.3 Disaster Recovery in Multiple City Centers

In Shanghai area


There are multiple availability zones

az, accessible zone

Two or two networks available for each

Delay less than 2ms

In this network architecture

Multiple copies scattered around

Other az allowed

High Availability Solution


Replacing an Ensemble with a Zone

The strategy is solution to this scenario

In following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

Disaster recovery in multiple centers within a city based on a zone-aware strategy

Open area

The perceptual strategy has two limitations:

a.E % Qw == 0; b.

Qw > minNumOfZones


E means ensemble size

Qw means number of copies

minimum number of zones

Represents minimum number of zones in an ensemble


minNumOfZones = 2
desired number of zones = 3
Kv = 3
[z1, z2, z3, z1, z2, z3]

To failure

Each piece of data has three copies

and distributed across three availability zones

On z1 failure


minNumOfZones limit


New Ensemble

z1, z2, z3, z1, z2, z3

-> [z3, z2, z3, z3, z2, z3]


For each piece of data in triplicate

Will still span across two Availability Zones

Can still allow zone to fail



When choosing a bookmaker to form an ensemble

Need to reverse solution via ip

Display relevant zone information

Requires user to implement a parser


Network segment between zones

Considered scheduled and non-overlapping


When I landed

A simple implementation

Subnet converter that can be configured dynamically

The example shows an implementation of exact IP address matching

In following way:

public class ConfigurableDNSToSwitchMapping extends AbstractDNSToSwitchMapping {

    private final Map
    mappings = Maps.newHashMap();

    public ConfigurableDNSToSwitchMapping() {
        Mapping.put("", "/z1/"); // /zone/update domain
        Mapping.put("", "/z2/");
        Mapping.put("", "/z3/");

    public boolean useHostName() {
        return false;

    public List
      names) {
       rNames = Lists.newArrayList();
        names.forEach(name -> {
            String rName = mappings.getOrDefault(name, "/default-zone/default-upgradedomain");
        return names;


Custom DNS resolver example

Data copies are distributed in one zone

When for some reason

For example, drill down into an availability zone failure

as a result

When only one availability zone is available

New recorded data

All replicas will be in same Availability Zone

When recovering a failed Availability Zone


Some historical data only exists in one Availability Zone

Does not meet requirementsHigh Availability for Multi-Zone Disaster Recovery

Automatic recovery mechanism

There is a PlacementPolicy discovery mechanism

but missing

Recovery mechanism

So I made a patch

Dynamic engine support

Turning this feature on and off


Availability Zone Failover and Recovery

Data can be automatically detected and recovered

All copies are distributed in one print run

Availability zone affects data availability

3. Implementing Agile Architecture

After getting to know BookKeeper

Delayed messaging service architecture is relatively beautiful

Business layer message

Completely separate from storage layer

The deferred messaging service itself is stateless

Can be stretched easily

If an availability zone fails, a master-slave switchover is no longer required

In following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

New Delayed Messaging Architecture

3.1 Stateless transformation

After separating storage layer

It is possible to reach stateless business layer

To achieve this goal

Some problems still to be solved

I'm telling you

Let's look at some restrictions on using BookKeeper

1. Accountant

Co-writing is not supported

That is, if multiple nodes in business layer are writing data

They must be different ledgers

2. Although BookKeeper

Allow more reading

However, if multiple application nodes are read separately

Progress is independent of each other

Apps should resolve scheduling issues on their own

The above two main problems

Choose us

When implementing stateless elastic expansion and contraction


Solve read/write resource allocation problem yourself

For this

I introduced task coordinator

I'm first

Managing storage segments

Read and write operations are supported per segment

But at same time

There can only be one business layer node for reading and writing


I treat shards as resources

Think of business layer nodes as workflows

The main responsibility of task coordinator is then

1. As far as possible

At medium level

Distribute resources among performers in order of priority

2. Monitoring

Changes in resources and workers

Enlargement or reduction is possible. Re-execution of duty 1

3. When resources are not enough

According to specific policy configuration

Add and initialize new resources

Because it's a distributed environment

Coordinator completes work on his own

The above responsibilities are necessary for distributed consistency

Of course, usability requirements must also be met

I chose

Based on ZooKeeper

Architecture "one master-many slaves"

In the following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

As shown

The coordinator is deployed in a peer-to-peer network in a business layer application node


The coordinator is based on

ZooKeeper leader selection mechanism

Define Lead Node

And master node is responsible for above task assignment

In following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you


Election Implementation Handbook

Official ZooKeeper documentation. No details here

3.2 Persistent data

Initial structure

Message will be delayed

According to schedule

Saves locally in segments every 10 minutes

Buckets near time are loaded into memory


HashedWheelTimer for scheduling

This scheme has two drawbacks

More segments

Supports delay up to 2 years

Theoretical number of segments is greater than 100,000

Data in one segment

10 minutes

If not everything can be loaded into memory

Then because

Segments are not sorted by planning time

may appear

The unloaded part contains planning time

Old data

Loading stuck

If disadvantage is 1

100,000+ local files on a single computer - no problem

But after transformation

This segment information is presented as meta-information

Save to ZooKeeper

My implementation plan

Solved for each segment

Occupy at least 3 ZooKeeper nodes


I want to deploy 5 clusters

On average, there are 10 segments in each cluster

100,000 segments per segment

ZooKeeper node to use

Number from 15 million

This is unbearable for ZooKeeper

Disadvantage 2 - independent of old and new architecture

This is all a potential problem

One time

There are more messages in specified 10 minutes

This may delay message

When loaded into memory

There should be more precise detail

Based on analysis of above problems

You mean layered time

The idea of ​​cyclic planning

Small changes

Designed set

Multilevel scheduling scheme based on sliding time division

In following way:

Scheduler name

Barrier (minimum planning time)

One segment size

Normal number of segments


T + 14 days

1 week



T + 2 days

1 day



T + 2 hours

1 hour


L0 m


1 minute


As shown in table above

Largest segment - 1 week

After 1 day

1 hour 1 minute


Levels span different time frames

Cumulated over a period of 2 years

Theoretically, a range requires a total of 286 segments

Compared to original volume of more than 100,000 barrels, it has been qualitatively reduced


Level L0m only

Required by scheduler

Load data into HashedWheelTimer

So, loading

Detail uvelimited to 1 minute

Substantially reduced due to impossibility

Scheduling delay due to full segment loading

Multilevel schedulers work together in a cascading fashion

Each level

When scaler receives a write request

First, try delegating authority

Great detail

Scheduler Processing

If boss accepts

Then just

Return result of processing of parent

If boss does not accept

Then evaluate if it belongs to this level

If yes, put it in trash, otherwise send it back to subordinate

Each level


Segment with closing by time

Open and send data to next level scheduler

For example

L1h finds smallest bucket

It's time to preload

Then put data in basket

Read and send to scheduler L0m


Data for this hour

Moved to L0m and extended to

Up to 60 minute segments

In following way:

Why messages are delayed in Ctrip BookKeeper How system works I'll tell you

4. Planning for future

Current bookmaker

The cluster is deployed on a physical machine

Creating a cluster

Scaling and shrinking are relatively problematic

In future, we will consider possibility of integration into k8s system

Betting shop management

Platformization also needs to be considered

Currently, there is only same multi-site disaster recovery capability in city


Disaster recovery and hybrid public/private

High availability architectures such as cloud-based disaster recovery also need further strengthening