How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Original: Li Sheng

1. Background


Development of foreign hotel business

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

All over world

Overseas Suppliers and Ctrip IDC Headquarters

The amount of data transferred between them is growing rapidly


This growing amount of data is critical for cross-border networks

Dedicated line bandwidth

Latency, etc. make higher demands


Due to current limited cross-border network line

Resources to improve business process efficiency and user experience

It also had some impact

About cost

A leased line of a cross-border network as an expensive resource


Leased Line Extension Will Hugely Impact IT Costs

So I started wondering if it's possible to go public

Combination of cloud services

Business characteristics of direct connection of hotel to solution

Bandwidth increase and ISP latency issues


The direct connection system mainly uses automation interface

Implement system communication between providers or groups and Ctrip

Injecting static information

Dynamic Information

Order function, etc.

Everything flows and interacts systematically

Currently Ctrip

A large number of foreign hotel enterprises are connected through hotel direct connection system


What am I talking about

Mostly from Ctrip hotels

In process of migrating and deploying Direct Connect Services on AWS

Application architecture

Adjustment and cloud transformation

Technical and business benefits of using AWS

EKS during deployment

Amazon Elastic Kubernetes Service

DNS query

Latency and traffic between availability zones

Reduce costs

Some aspects, such as optimization, will be described in detail

2. Pain point

Ctrip Hotel

Direct foreign connections connected thousands of foreign suppliers

All interfaces

Access goes through a proxy

Figure 1 looks like this

Due to hotel's direct connection business characteristics

When a user asks to come, it will depend on number of people


Members, non-members, etc. split into multiple queries

When most often

It is possible that one request will be split into dozens of requests

And request message is very large

Usually tens of kilobytes to hundreds of kilobytes

Although we

You may only want to return a small portion of information in message

But due to limitations of current architecture

All Only

All packets are requested and then processed

This is undoubtedly a huge waste of bandwidth

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 1


Because there are suppliers all over world

All Requests

Responses must be exported via group proxy

resulting in a partial provider interface

Response depends on physical distance and latency gets longer

This will degrade user experience

3. Choosing a cloud service and a preliminary plan

One of main goals of this year

This needs improvement

Connect to network transmission capabilities of global providers

And postpone improvements. Improve user experience


Cloud service provider with extensive distribution of resources around world

Coordinate Ctrip to access data as close to providers as possible

passed with several

Multiple rounds of exchanges between public cloud providers

Comprehensively consider technical level of each manufacturer

Service Capabilities

Price and many other factors

I think AWS is irrelevant in world

Coverage and networking

See rice. 2

AWS is distributed worldwide

25 regions and 80 availability zones

Provide a wide range of service options

At same time, data center is interconnected through its backbone

Improve data of various data centers in future

Opportunity between visits

Cloud Services

development and maturity

Service capabilities of field team

Response time

Pro level has obvious advantages

In end, I think

Choose AWS as your Cloud Service Provider Partner to Deploy Resources

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 2


Improved integration using resources in cloud

I recommend using IDC containerization

Deployment plan

As a last resort, consider making hosted container platforms highly available

Design and SLA

And compatibility with community

Using AWS

Hosted EKS Container Platform as Deployment Platform


I recommend

After service update

Wide use of Spot Instances as EKS worker nodes

Significantly reduce costs and increase efficiency

Using both options

Public Cloud Network and Platform Benefits

Relevant business was originally deployed at Ctrip headquarters IDC

Service deployment

Go to an overseas public cloud site closer to provider

Implement Ctrip and foreign

High reliability among suppliers

Low latency direct network connection

And some data

Pre-processing logic removed for pre-deployment

Move to an overseas public cloud

Only process valuable data

Instead of original, full amount of raw data

Compress and upload to Ctrip data center

To reduce load on cross-border network line

Improve efficiency of business data processing

Goals such as cost reduction and user experience optimization

4. Direct connection of hotel to cloud

I said:

4.1 Cloud Transformation of Cloud Business Applications

To fully

Convenience and cost optimization through use of cloud services

After research and analysis

I recommend

If application is directly migrated to public cloud

Although business will generate corresponding value

But cost will be relatively high

That's why we provide a direct connection service to hotel


Optimization of cloud architecture. Related major adjustments

In following way:

1. Access provider module in cloud

Reduce bandwidth to save bandwidth

Through a request from a proxy

Simultaneously reduce batch size of each request

Our approach should be to query

Partitioning logic moved to AWS

One user request at a time

There is only one request to exit through proxy. Answer


We're on AWS

Remove useless attributes from message returned by provider


According to business attributes to combine related

The node is finally compressed and returned

This achieves goal of reducing message size

See fig. 3

From current operational data

Entire agent

Bandwidth is only used up to 30-40%

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 3

Public cloud providers

A traffic-based pricing strategy is typically applied

Designing inbound and outbound networks

In process of accessing technical program

AWS NAT gateway will be used by default

This network

Traffic costs are relatively high

Given that a direct connection request to a hotel has a function

Typically, request message is less than 1 KB

Reply message

On average, from 10 to 100 thousand

Use this feature

We switched to AWS

Squid standalone proxy solution based on EKS

See rice. 4

This way only outgoing requests

Messages are subject to traffic charges

There is no charge for a large number of incoming response packets

Thus, network traffic charges generated on AWS are significantly reduced

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 4

2. Reduce network latency

Use AWS Global Data Center to access your nearest vendors

Many abroad

The provider's services are deployed around world

And all our foreign visits

Everyone exits agent

Some server deployments like this

Suppliers take longer due to physical distance

The reason is that network latency is high

Via AWS data centers around world

We can deploy service next to provider

Near computer room

Concurrent use of the AWS backbone

Reduce each datacenter to a proxy

Latency of AWS data centers near this location


Along selection

Connect AWS Data Center and Ctrip IDC

See rice. 5

Whole process

For those who are at a distance

Has a bigger impact on network latency

Supplier performance has improved significantly

Up to 50% reduction in response time

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 5

4.2 Continuous architecture transformation, performance and cost optimization

In current plan

We are for cloud

Independently developed a new set of applications

Problem with this

When business changes happen

We also need to customize

Two applications deployed on Ctrip IDC and AWS

Increased system maintenance costs

Main reason

This is a base component that heavily depends on Ctrip in source application

This time in cloud

Attempt to use

Fully independent VPC account and network

Deploying a set in cloud is unrealistic

First, price is too high


Some sensitive data cannot be stored in cloud

We will follow

Optimize adapter architecture

No dependencies

In case of Ctrip main components

Reuse a set of applications to adapt to different cloud environments

After starting a business

To test more load in future

The ability to move to cloud

We are also working on performance

Continuous cost optimization and high availability

4.2.1 Using Cloud Elastic Scalability

Let's take cost of computing resources as an example

Instance cost calculation = instance running time

Instance price

If everything was simple

Roughly change mode of operation of local computer room

Apply to cloud computing

Cloud Computing

The cost of resources is higher than local computer room

So here we are

Make full use of cloud-on-demand charging

Reduce cost of idle resources

Instance started

Duration and number of services in a Kubernetes cluster

and assigned

The computing resources of these services are proportional

The number of simultaneous services is proportional to traffic

Hotel Direct Connection Business Scenario

There is unpredictable business traffic

For example, tourism policy released before holidays

Or live marketing events

Elasticity in Cloud

Well-used features

Smart resources to deal with sudden traffic

Kubernetes HPA Elasticity

The structure will be assembled in real time

Overall cluster load index

Elasticity conformity assessment

Scaling conditions and module scaling

It's not enough to simply scale a module

We should also be aware

Use cluster autoscaling feature in cluster

Monitoring cluster

Due to insufficient resource allocation

Pods that cannot be scheduled normally

Automatically fromcloud platform instance

Application for adding nodes to pool

Same time in traffic

When falling

Cluster autoscaling component

It will also detect in cluster

Nodes with low resource usage


Scheduled for other available nodes

Dispose of this portion of idle nodes

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Elastic scaling

Elasticity in Cloud

Functions don't just help reduce resource usage costs

Also improve service

Fault tolerance to infrastructure failures

In infrastructure section

Availability Zone Break Period

Other regions available

An appropriate number of nodes will be added

Continue so that entire cluster is available

Kubernetes supports pods

Required CPU and memory settings

Find a reasonable quote at a reasonable price

Achieve maximum performance

So here we are

Before service moves to cloud

Perform real world load testing

See changes in business traffic

Impact on cluster performance

Business Frequency

Peak and low peak loads

Service bottleneck in resources

Adequate Margin

The resource buffer needs to handle peak traffic, etc.

Not because

Actual usage is too high to cause stability issues

For example, OOM

Or frequent CPU throttling

It won't waste resources because it's too small

In end

Even if your app only uses 1% of instance

Also pay 100% of copy price

4.2.2 Using staking instances in public cloud

Some cloud platforms

Some idle compute resources will be used as spot instances

Lower than On-Demand instances

Stakes, as name suggests

Final copy price

Based on market supply and demand

According to our real experience

If not

Prices for particularly popular models are generally set on request

About 10-30% of cost of a copy

Inexpensive Spot Instances naturally have their limitations

Cloud platform

The rate is subject to adjustment

Instance pool resource ratio restarts some instances

The chance of overall recovery is usually <3% statistically,


These instances will be notified 2 minutes before reuse

Usually we

Terminal handler component provided by AWS

After revocation notice

Schedule container for other available instances in advance

Resources reduced

Impact of review on service

The figure below shows division of cloud resource pool to host instances

We see

Even same instance resources

This is also an independent pool of resources in different Availability Zones

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 6

To be

Minimize Impact of Spot Instances on Outages

Include examples

Impact of rebalancing across multiple availability zones

We are passing

ASG (AWS Autoscaling Group

Elastic expansion group

When choosing different types of instances

Add different instance resource pools

Independently use ASG for control

This ensures maximum resource efficiency

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 7

Ctrip Hotel

Direct use of on-demand instances and spot rates

Mixed instance deployment

Guaranteed low cost and high availability

Some Key System Components

For example, cluster autoscaler

Stateful service that loses data when interrupted

For example, Prometheus

Launch instances on demand

And high error tolerance

Flexible use

Stateless business applications run on point instances

Nodes via kubernetes

Affinity manages different types of services

Scheduled for an instance of corresponding type tag

See rice. 8

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 8



Native HPA components and ClusterAutoscaler


Full use of AWS ASG and staking resources

Can save 50-80% on costs

4.2.3 Optimizing DNS resolution performance

When serving

When zooming in gradually

We found

Call latency between services has increased significantly

On average it reaches 1.5 s

Peak reaches 2.5 seconds

After analysis, it was discovered


Because DNS resolution load is too high

Bottleneck in performance analysis caused by

Finally, we are accepting a larger community

local method

Run local cache for domain names with hotspot resolution

To shrink core

Frequent DNS resolution requests to improve performance

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 9

As shown in fig. 9

At each node

DaemonSet based deployment

NodeLocal DNSCache


LocalDNS host

CoreDNS Mitigation

Service DNS query load

LocalDNS cache will listen

On node

Query DNS resolution for each client module

Tuning with local analysis

Local DNS cache

will first try to resolve request via cache

Go to CoreDNS if you missed it

Query analysis results

and cache it for next local parsing request

As shown below

Using LocalDNS scheme

We will have peak latency

Decreased from 2.5s to 300ms

An 80% reduction in response time:

Before using LocalDNS, average response is 1.5-2.5 seconds

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Before optimization

After using LocalDNS solution

Reducing number of requests per response

Up to 300-400ms latency is optimized by 80%

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

After optimization

4.2.4. Optimize Public Cloud Traffic Across Availability Zones


After bidding instance has greatly optimized resources

We noticed

Cross-Area Traffic Increases Significantly After Maintenance

Very high share (60%)

This is because when called between services

We have service centers

Deploy across Availability Zones

Maximum service availability

At same time, problem is that

A large number of traffic interactions between services

Inter-availability zone traffic charges

See rice. 10

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 10


For high availability of entire system

We don't want to deploy service in a single Availability Zone

Shorten service SLA

We need

Reducing traffic between Availability Zones

Guaranteed high service availability

Finally, after various studies of program

We use AWS NLB to provide services

Disable cross-az via NLB

For upstream and downstream of same Availability Zone

The service controls traffic availability zone


Use previously mentioned local DNS component

Access upstream service for NLB

Improved domain name resolution for different availability zones

Guaranteed outgoing and outgoing service traffic

Only relationship within availability zone

As shown below after conversion:

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 11

Previous paragraph

The service will go through the K8s Kube proxy

Create cross-availability zones and cross-sites

We choose

Using externalTrafficPolicy

Local Policies

Transfer traffic

Fixed on localhost service

But also local

The strategy also brings some problems

See rice. 12

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 12

As shown above

Local Policies

This may be due to distribution of server services

Imbalance leads to black holes in traffic and services

Unbalanced load

So, based on this

We use EKS to scale group policies flexibly

Balanced resource allocation of base nodes

Different accessibility zones

Simultaneous use of K8 anti-close strategy

Distribute service as widely as possible

Navigate to nodes in different availability zones

Maximum traffic balance guarantee

Although guaranteed

High availability deployment of services in cross-availability zones

Optimized Range

Availability Zone traffic decreased by 95.4%

See rice. 13

How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.

Picture 13

5. Further directions for optimization and improvement

Current structure

Although some issues in our business have been resolved

But still

There are some flaws that can be fixed

To visit nearest supplier

We use an independent VPC network

To deploy and test our cluster

So you need to be alone

Deploy related storage dependencies to cloud

And log monitoring components

This will surely increase

Difficulty in operation and maintenance

And difficulty of transferring services to other clouds


The solution to this problem in architectural design

We plan to make following changes


Need to compute in cloud and

Use persistent data storage

Move back to Ctrip IDC

Thus, this part of data does not need to be transferred to cloud

Secondly, thanks to company

Other Data Centers in AWS

There is already a mature environment

So here we are

To pass, you only need to cooperate with OPS


VPC network between AWS data centers

You can use company registration and monitoring system

Reducing operating and maintenance costs


Ctrip Hotel directly connected

What I said

The practice of using cloud technologies

How to quickly create a set in cloud

Stable and efficient production environment for fast delivery

Smart sustainability

And some in cloud

Cost optimization

With help of cloud system

Infrastructure automation

Release of part of operation and maintenance work

You can invest more in business development

Respond more flexibly to recurring business requirements

With monitoring and logging

Rapid trial and error and feedback

I warn you

More teams looking to move to cloud

Less detours, take advantage of cloud