Original: Li Sheng
Original Profile
Li Sheng
06
Graduated from: Guangxi Institute of Finance and Economics
Company
Trip
Occupation
System Data Technician
Ctrip.Internet.System Data Technician
This article is mostly
Explain
Middleware development and evolution of flexible architecture
Simultaneously optimize network and performance
1. Background
QMQ system latency message
Independence as a Service
Existing set is not limited
Solutions implemented by message providers
Its structure is shown in figure below
In the following way:
I'll explain
QMQ Delayed Message Service Architecture
Delay message
After delivery from manufacturer to service with a delay
Store on server's local drive
When scheduling messages expire
Service Delay
To broker in real time for consumer consumption
Delay service uses master-slave architecture
Where
Zone represents an availability zone
In general, this can be understood as IDC
For
Guaranteed after single zone failure
Pending scheduling of historical delivery
Typical message scheduling
Master and slave will be deployed in different Availability Zones
Where
1.1 Painful points
This architecture mainly has following problems
1. The service has a state
Unable to flexibly expand and contract
2. After failure of main node
Master-slave switching required. Automatic or manual
3. No consistency coordinator to ensure data consistency
If business layer of message
Separate from storage layer
Appropriate evolution and coordinated development
Focus on areas they are good at
So
Messaging business layer can be stateless
Simple container transformation
Elastic expansion and contraction options
The storage tier represents distributed file storage services
High availability and data consistency guaranteed by storage services
1.2 Choosing a Distributed File Storage
And
To select a storage service
Beyond basic high availability data
External Consistency
Another important point
High fault tolerance and low O&M costs
The main feature of a distributed system
Nature is tolerance for partial node failure
In end
Any hardware or software failure
This is 100% inevitable
Therefore
High fault tolerance and low O&M costs
This will be most important thing in our choice
2016
Pulsar contributed to Apache from Yahoo Open Source
Because it was created in cloud
Low latency distributed messaging
Tags for Queues and Streaming Platforms
In open source community
Invoke sensation and stalking
After appropriate research
I found Pulsar
Also
Separation of message business and storage architecture
Storage level
Another BookKeeper from Apache Open Source Foundation
Two. Accountant
Accountant
As scalable
High fault tolerance
Low z distributiondelay
Strongly Consistent Storage Service Disabled
Some companies use it for production deployment
Best practices
Enable namenode instead of HDFS
Storing and consuming Pulsar messages
Object persistence and storage
2.1 Basic architecture
Basic architecture of BookKeeper
Zookeeper cluster is used for
Discovery of storage nodes and storage of meta-information
Provide strong consistency guarantee
Storage node of a bookmaker
Provide storage services
Write
And while reading
Bookmaker nodes should not interact with each other
When does bookmaker start
Log in to Zookeeper cluster. Grant access to service
Client belongs
Thick client type
Responsible for communicating with Zookeeper
Direct communication between cluster and BookKeeper cluster
And according to meta information
Finish multiple copies. Make sure data can be reread
2.2 Main Functions
1. Basic concepts
Login
Basic unit of data carrier
General ledger
Excerpt from collection of articles. Similar to file
Bookmaker
A brief overview of registry. Physical storage nodes
Ensemble
collection of bookmakers
2. Reading and writing data
As it should
Reading and writing BookKeeper data
betting client
Hold ledger while creating it
You can later
Perform record write operation
feed entry
Distributed by enemble bookmaker
The record is numbered on client side
Each entry will be based on set number of copies
.Qw. Record success request
betting client
Opening already
The created workbook performs a read record operation
The order in which a record is read corresponds to order in which it is written
Reading from first replica by default
Retry from next replica after read failure
3. Data Consistency
Hold
A betting client that can write ledgers is called Writer
Skip
Distributed lock mechanism secures account
There is only one Writer in world
Writer uniqueness guaranteed
Data writing sequence
Maintain LAC in Writer memory
Last addition confirmed
Subject to Qw requirements
Update LAC
VARNISH
With next request or time
Save at bookmaker
When book is disabled
Save to metadata store
zookeeper or etcd.
Hold
A betting client that can read account books
Invoked as a reader
Account book
There can be any number of readers
LAC Strong Consistency
Different readers see a single view of data
Also re-read
This ensures data read consistency
4. Fault tolerance
Typical failure scenarios
Writer crashes or restarts, Bookie crashes
Write failed
The ledger cannot be closed
results in unknown LAC
Via ledger recovery mechanism
Close book
Restore LAC
Bookmaker error
Error writing entry
Via ensemble replacement mechanism
Update
New Entry Route
Information in metadata store
Ensure timely successful recording of new data
Historical data
Via bookmaker recovery mechanism
Respect Qw copy requirements
Reduced reliability of historical data reading
Regarding
Everything where copy is
All bookmaker nodes fail. It remains only to wait for repair
5. Load balancing
New extension
Bookmaker in a cluster
Traffic is automatically balanced when a new book is created
2.3 Disaster Recovery in Multiple City Centers
In Shanghai area
region
There are multiple availability zones
az, accessible zone
Two or two networks available for each
Delay less than 2ms
In this network architecture
Multiple copies scattered around
Other az allowed
High Availability Solution
Accountant
Replacing an Ensemble with a Zone
The strategy is solution to this scenario
In following way:
Disaster recovery in multiple centers within a city based on a zone-aware strategy
Open area
The perceptual strategy has two limitations:
a.E % Qw == 0; b.
Qw > minNumOfZones
Where
E means ensemble size
Qw means number of copies
minimum number of zones
Represents minimum number of zones in an ensemble
Example:
minNumOfZones = 2
desired number of zones = 3
E=6
Kv = 3
[z1, z2, z3, z1, z2, z3]
To failure
Each piece of data has three copies
and distributed across three availability zones
On z1 failure
satisfy
minNumOfZones limit
Create
New Ensemble
z1, z2, z3, z1, z2, z3
-> [z3, z2, z3, z3, z2, z3]
Obvious
For each piece of data in triplicate
Will still span across two Availability Zones
Can still allow zone to fail
DNSResolver
client
When choosing a bookmaker to form an ensemble
Need to reverse solution via ip
Display relevant zone information
Requires user to implement a parser
Considering
Network segment between zones
Considered scheduled and non-overlapping
Therefore
When I landed
A simple implementation
Subnet converter that can be configured dynamically
The example shows an implementation of exact IP address matching
In following way:
public class ConfigurableDNSToSwitchMapping extends AbstractDNSToSwitchMapping {
private final Map
mappings = Maps.newHashMap();
public ConfigurableDNSToSwitchMapping() {
super();
Mapping.put("192.168.0.1", "/z1/192.168.0.1"); // /zone/update domain
Mapping.put("192.168.1.1", "/z2/192.168.1.1");
Mapping.put("192.168.2.1", "/z3/192.168.2.1");
}
@Override
public boolean useHostName() {
return false;
}
@Override
public List
resolve(List
names) {
List
rNames = Lists.newArrayList();
names.forEach(name -> {
String rName = mappings.getOrDefault(name, "/default-zone/default-upgradedomain");
rNames.add(rName);
});
return names;
}
Custom DNS resolver example
Data copies are distributed in one zone
When for some reason
For example, drill down into an availability zone failure
as a result
When only one availability zone is available
New recorded data
All replicas will be in same Availability Zone
When recovering a failed Availability Zone
More
Some historical data only exists in one Availability Zone
Does not meet requirementsHigh Availability for Multi-Zone Disaster Recovery
Automatic recovery mechanism
There is a PlacementPolicy discovery mechanism
but missing
Recovery mechanism
So I made a patch
Dynamic engine support
Turning this feature on and off
So
Availability Zone Failover and Recovery
Data can be automatically detected and recovered
All copies are distributed in one print run
Availability zone affects data availability
3. Implementing Agile Architecture
After getting to know BookKeeper
Delayed messaging service architecture is relatively beautiful
Business layer message
Completely separate from storage layer
The deferred messaging service itself is stateless
Can be stretched easily
If an availability zone fails, a master-slave switchover is no longer required
In following way:
New Delayed Messaging Architecture
3.1 Stateless transformation
After separating storage layer
It is possible to reach stateless business layer
To achieve this goal
Some problems still to be solved
I'm telling you
Let's look at some restrictions on using BookKeeper
1. Accountant
Co-writing is not supported
That is, if multiple nodes in business layer are writing data
They must be different ledgers
2. Although BookKeeper
Allow more reading
However, if multiple application nodes are read separately
Progress is independent of each other
Apps should resolve scheduling issues on their own
The above two main problems
Choose us
When implementing stateless elastic expansion and contraction
Required
Solve read/write resource allocation problem yourself
For this
I introduced task coordinator
I'm first
Managing storage segments
Read and write operations are supported per segment
But at same time
There can only be one business layer node for reading and writing
If
I treat shards as resources
Think of business layer nodes as workflows
The main responsibility of task coordinator is then
1. As far as possible
At medium level
Distribute resources among performers in order of priority
2. Monitoring
Changes in resources and workers
Enlargement or reduction is possible. Re-execution of duty 1
3. When resources are not enough
According to specific policy configuration
Add and initialize new resources
Because it's a distributed environment
Coordinator completes work on his own
The above responsibilities are necessary for distributed consistency
Of course, usability requirements must also be met
I chose
Based on ZooKeeper
Architecture "one master-many slaves"
In the following way:
As shown
The coordinator is deployed in a peer-to-peer network in a business layer application node
Runtime
The coordinator is based on
ZooKeeper leader selection mechanism
Define Lead Node
And master node is responsible for above task assignment
In following way:
Coordinator
Election Implementation Handbook
Official ZooKeeper documentation. No details here
3.2 Persistent data
Initial structure
Message will be delayed
According to schedule
Saves locally in segments every 10 minutes
Buckets near time are loaded into memory
Use
HashedWheelTimer for scheduling
This scheme has two drawbacks
More segments
Supports delay up to 2 years
Theoretical number of segments is greater than 100,000
Data in one segment
10 minutes
If not everything can be loaded into memory
Then because
Segments are not sorted by planning time
may appear
The unloaded part contains planning time
Old data
Loading stuck
If disadvantage is 1
100,000+ local files on a single computer - no problem
But after transformation
This segment information is presented as meta-information
Save to ZooKeeper
My implementation plan
Solved for each segment
Occupy at least 3 ZooKeeper nodes
Assumption
I want to deploy 5 clusters
On average, there are 10 segments in each cluster
100,000 segments per segment
ZooKeeper node to use
Number from 15 million
This is unbearable for ZooKeeper
Disadvantage 2 - independent of old and new architecture
This is all a potential problem
One time
There are more messages in specified 10 minutes
This may delay message
When loaded into memory
There should be more precise detail
Based on analysis of above problems
You mean layered time
The idea of cyclic planning
Small changes
Designed set
Multilevel scheduling scheme based on sliding time division
In following way:
Scheduler name
Barrier (minimum planning time)
One segment size
Normal number of segments
L3w
T + 14 days
1 week
104
L2d
T + 2 days
1 day
14
L1h
T + 2 hours
1 hour
48
L0 m
T+0
1 minute
120
As shown in table above
Largest segment - 1 week
After 1 day
1 hour 1 minute
Each
Levels span different time frames
Cumulated over a period of 2 years
Theoretically, a range requires a total of 286 segments
Compared to original volume of more than 100,000 barrels, it has been qualitatively reduced
Simultaneously
Level L0m only
Required by scheduler
Load data into HashedWheelTimer
So, loading
Detail uvelimited to 1 minute
Substantially reduced due to impossibility
Scheduling delay due to full segment loading
Multilevel schedulers work together in a cascading fashion
Each level
When scaler receives a write request
First, try delegating authority
Great detail
Scheduler Processing
If boss accepts
Then just
Return result of processing of parent
If boss does not accept
Then evaluate if it belongs to this level
If yes, put it in trash, otherwise send it back to subordinate
Each level
Scheduler
Segment with closing by time
Open and send data to next level scheduler
For example
L1h finds smallest bucket
It's time to preload
Then put data in basket
Read and send to scheduler L0m
final
Data for this hour
Moved to L0m and extended to
Up to 60 minute segments
In following way:
4. Planning for future
Current bookmaker
The cluster is deployed on a physical machine
Creating a cluster
Scaling and shrinking are relatively problematic
In future, we will consider possibility of integration into k8s system
Betting shop management
Platformization also needs to be considered
Currently, there is only same multi-site disaster recovery capability in city
Cross-region
Disaster recovery and hybrid public/private
High availability architectures such as cloud-based disaster recovery also need further strengthening
Related
Why messages are delayed in Ctrip BookKeeper How system works I'll tell you
Listing mechanism for cooperation between hotels and Ctrip. Do you understand? What's use? I tell you
Evolution of hotel retail: what is a flexible transition from B-end to C-end? Look carefully, I'll tell you in order
What is use of new "Insight" feature displayed in hotel's OTA system? let me tell you
How does Xiaohongshu marketing work? What system model? What is the marketing strategy? let me tell you
Our Ctrip AI: from recommendation to system cloudiness, from algorithm to formula, how does it work?
Hotel management. You point out that other side has a problem. The other side thinks you are blaming. How to deal with it? I tell you
Hotel: Relationship between process management and system What's going on? let me tell you
How does Meituan.Hotel OTA work? let me tell you
I’ll tell you about hotel’s marketing formula: why use it, attendance, promotions, increment, etc.