Original: Li Sheng
Actually
An Intelligent Placement Big Data Platform (HData for short)
Ctrip is a hospitality data visualization platform
In short
This allows you to more intuitively display and interpret data as charts, helping companies gain knowledge and insights
Make right decision
Make quick decisions and make fewer mistakes. Inside a large hotel, each department takes care of different metrics
Permission controls are different, so way data is displayed is also varied
Data
Approximately 2200 UV views and 10 watt PV views occur each day
The number of visits during holidays will double or triple
Since using Clickhouse in 2018
90% of business lines depend heavily on Clickhouse, and about 95% of interfaces have a response time of less than 1 second
The high performance of Clickhouse queries is fully reflected
Now
The total number of rows of data is about 70 billion, and there are more than 2000 processes running per day
The number of rows of data to be updated is about 15 billion
Total data capacity before compression: 8T, total data capacity after compression: 1.75T
But Clickhouse
The downside of not being able to support highly concurrency queries is also apparent, with CPU consumption below 30% in most cases.
When a large number of requests from users
CPU usage can be very high
and
If you have a complex query with high consumption, only manual cleanup
It may be possible to increase CPU usage to 40C in a very short period of time
Typically peak access occurs in morning on weekdays
To keep system stable
Implement a mechanism for active cache installation + passive cache triggering by users to reduce load on Clickhouse server
On one side
Results from some frequently visited pages will be cached
On other hand, after offline data update is complete
Through analysis, take lead in caching default condition data for users who have accessed relevant data in last 5 days to reduce peak load
The current active cache + passive cache replaces most of query volume that was originally required from Clickhouse
This avoids endless server expansion
At same time, it can also smooth out spikes caused by centralized and concurrent requests
Pain point
During holidays, focus is on real-time data
Using this year's May Day example, real-time visits to kanban boards are about 10 times higher than usual
Weekdays
CPU usage is typically under 30%
Vacation period
Once CPU usage exceeded 70%, which created a great hidden danger to stability of server
In this situation
On one hand, we've introduced throttling in frontend to prevent users from making frequent requests
At same time, caching is also done on back-end, but real-time data caching time should not be too long
Typically, 1-2 minutes is acceptable limit for users
As you can see from figure below, offline data cache hit rates are generally higher
In general, it can be over 50% or even higher, but for real-time data, hit rate is only around 10%
On other side
We have enabled server-side shunting: in real application scenarios, some companies have relatively small permissions
The amount of data requested will be relatively small
The threshold is determined by analysis, e.g. users with less than 5000 permissions requesting data from MySQL
The query speed of these users is very high even through MySQL
Allow high-level users to request data through Clickhouse, which can attract a large number of users
Although this temporarily solves current problem, new problems arise one after another
The data must be written twice in Clickhouse and MySQL, and data consistency on both sides cannot be guaranteed
Using two sets of data at same time, resulting in increased hardware costs
Clickhouse does not support standard SQL syntax, so two sets of code need to be maintained, which increases development costs
Calls for above problems
My goal
We need a new ROLAP engine to reduce development, operation, and maintenance costs while keeping query performance in mind.
Also, it is better applicable in high concurrency and high throughput scenarios
To do this, I tried several other engines available on market
For example, Ingite, CrateDB, Kylin, etc.
Each engine has its own unique advantages in terms of hardware cost or performance
However, given use cases, we ended up choosing StarRocks
What is StarRocks?
Listen to me
StarRocks is a high performance distributed relational column database
Running framework via MPP
A single node can process up to 10 billion rows of data per second and supports star and snowflake schemas
The StarRocks Cluster consists of FE and BE
You can use MySQL client to access your StarRocks cluster
FE receives connection from MySQL client
Parsing and executing SQL statements, managing metadata, executing SQL DDL commands
Use directory to write library, spreadsheet, partition, tablet copy and other information
Copy of BE Control Tablet
tablet is a sub-table formed by dividing table into sections and groupings
Introduce columnar storage. BE is guided by FE to create or delete sub-tables
BE receives physical execution plan propagated by FE and assigns a coordinator node to BE
According to BE coordinator's schedule, collaborate with other BE workers to complete execution
BE reads local column storage engine, gets data, and quickly filters data using indexing and predicate processing
I choose StarRocks mainly for following reasons
Request delay is less than a second
Good performance in complex multi-dimensional analysis scenarios such as parallel queries and multiple table matching
Support elastic extension, extension will not affect online business, and background will automatically complete data rebalancing
Service in cluster is hot standby, multi-instance deployment
Downtime, outages, and node failures will not affect overall stability of cluster services
Materialized view support and online schema modification
Compatibility with MySQL protocol, support for standard SQL syntax
His performance
I can check
The data in HData is mostly based on association of multiple tables
And Clickhouse is in a multi-table association scenario
The performance of a single computer is best. Below I use 3 test cases to compare StarRocks and Clickhouse respectively
A cluster of 6 virtual machines was built, 3 mixed departments FE and BE and 3 BE
Machine configuration is as follows
Software version: StarRocks Standard Edition 1.16.2
Clickhouse is set up like this:
Software version: Clickhouse20.8
Table name
Number of lines
basic information about foreign hotels
65 million
real-time daily order
1 million
realtimedailyorder_lastweek
2 million
basic information about hotel
75 million
holdroom_monitor_hotel_value
3 million
cii_hotel_eidhotel
45 million
Tensity_MasterHotel
10 million
Test Case 1
StarRocks time: 547ms, Clickhouse time: 1814ms
Test Case 2
StarRocks: 126ms, Clickhouse: 142ms
Test Case 3
StarRocks: 387ms, Clickhouse: 884ms
You can see
StarRocks query performance is in no way inferior to Clickhouse, even faster
Data update mechanism
Starstones
According to mapping relationship between loaded data and actual stored data
Add detail table, aggregation table, and data table update table
Corresponds to detailed model, aggregated model, and update model
Detailed Model
Table has rows of data with duplicate primary keys
One-to-one matching with loaded data rows, users can call all loaded historical data
Aggregation Model
Table does not have a row of data with a duplicate primary key
Retrieved rows of data with duplicate primary keys are merged into one row
The index columns of these data rows are combined by aggregation functions
Users can call accumulated results of all loaded historical data, but cannot call all historical data
Update Model
A special case of aggregation model, primary key satisfies unique constraint, and most recently loaded row of data
Replace duplicate data rows of other primary keys
Equivalent in aggregated model
The aggregation function specified for index column of data table is REPLACE
The REPLACE function returns latest data in a dataset
Starstones
The system provides 5 different import methods
To support various data sources (such as HDFS, Kafka, local files, etc.)
Or import data in another way (asynchronous or synchronous)
BrokerLoad: Broker Load accesses and reads external data sources through Broker process
Then use MySQL protocol to create an import job in StarRocks
Suitable for source data on a storage system (such as HDFS) available to Broker process
SparkLoad: Spark Load implements pre-processing of imported data using Spark resources
Improve StarRocks bulk data import performance and save StarRocks cluster compute
Streaming
Streaming is a synchronous import method
Submit an HTTP request to import local files or data streams into StarRocks
And wait for system to return import result status to determine if import was successful
Regular Boot: Normal Boot
Provides a function to automatically import data from a specified data source
Users submit standard MySQL import jobs
Create a resident stream to continuously read data from a data source (for example, Kafka)and import them into StarRocks
InsertInto: Similar to Insert statement in MySQL
Data can be inserted using statements such as INSERT INTO tbl SELECT ... or INSERT INTO tbl VALUES(...)
Data in HData is mainly divided into real-time data and T+1 offline data
Real-time data is mostly imported via Routineload
Mainly use update model
T+1 offline data mainly uses Zeus platform
Import by streaming, mostly using detailed model
Real-time data is implemented by Ctrip QMQ's own message queuing system
The following figure shows initial process of importing real-time data
Live data import process after accessing StarRocks
Soon I ran into a problem
One of scenarios is to subscribe to news about order status changes
Later we use order number as primary key and use update model to pass data
We provide external data showing that order status is not cancelled
After receiving message
You also need to call an external interface to fill in some other fields and finally put data on ground
But if a message is received, interface is called once, which puts pressure on interface
That's why we chose a batch approach
However, this creates a problem
Kafka itself cannot guarantee that global messages are in order
Only guarantee order within a section
Same batch, same order
However, if two pieces of data with different order status end up in different partition load routines, there is no guarantee which piece of data will be used first
If order status is cancelled, message will be processed first
Other order status messages to be used later
This will cause an order that should have been canceled to become non-cancelled, affecting accuracy of statistics
I also considered using native Kafka instead of QMQ
Use order number as a key to indicate which section to ship to
However, this requires secondary development, and cost of modification is not small
To solve this problem
We chose a compromise
At same time message was sent, log table was loaded with detailed model, and only order number needs to be stored in table
Order Status and Message Sending Time
Simultaneously
There is a scheduled task that sorts data of same order number in a table from time to time
Get latest data at time message was sent
Use order number to match data that doesn't match order status in official spreadsheet, then update it to compensate for data that way
Data T+1
I'm using Ctrip's proprietary data sync platform Zeus for ETL and import
In following way:
disaster recovery and high availability
Ctrip has very high requirements for disaster recovery
Company-level disaster recovery drills will be held from time to time
StarRocks already has a very good disaster recovery mechanism
Based on this
I built a high availability system that suits me
Servers deployed in two computer rooms
Provide external services with a 5:5 traffic ratio. Load balancing of FE nodes providing external services is implemented as configuration elements
Can be changed dynamically
Act in rereal-time mode (mainly considering situation where server patches, version updates, etc. need to be pulled manually)
Each FE and BE process uses a supervisor to protect process
Make sure process can be automatically started on unexpected exit
When an FE node fails
The surviving follower will immediately select a new master node to provide services
However, application side cannot perceive this immediately to deal with this situation
I started a timed challenge
Perform health checks on FE server at regular intervals
After FE node failure is detected
The failed node will be immediately removed from cluster, and developer will be notified by SMS
On BE node failure
StarRocks will automatically balance replicas
Outside services can still be provided
At same time, we will also have a scheduled task to check health
Whenever a BE node failure is detected, developer is notified via email
Simultaneously
We have also set up alarms for each server's hardware indicators using Ctrip's intelligent alarm platform
As soon as CPU, memory, disk space and other server metrics go down, developers can immediately notice and intervene
Now
70% of real-time data scripts in HData were connected to StarRocks
Average query response speed is about 200ms
Requests that take more than 500ms are only 1% of the total requests, and only one set of data and code needs to be supported
Labour and equipment costs are significantly reduced
Move all remaining real-time scenes to StarRocks
The offline scene is also gradually being migrated to StarRocks, and StarRocks is being gradually used to unify OLAP analysis of entire scene
Further improvements to StarRocks monitoring engine to make it more reliable
Separate cold and hot data by reading hive appearance to reduce hardware costs
Related
StarRocks system from Ctrip.com. What is this function? Listen to me. This feature is a smart residence data application.
What is .ARIMA timing system. on Ctrip.com? Let me tell you, this is for business volume forecasting.
A retired hotel manager asked me for tea at night. This is the hotel manager's retirement speech. Listen...
What is use of new "Insight" feature displayed in hotel's OTA system? let me tell you
How does Ctrip system work? I am a technical data operator of Ctrip. Let me talk about this with AWS.
The conversation of heart and soul. This is inadequacy of essence. The ability to listen gives you superpower
How to play Dowin? What is key? What is main logic of Douyin? let me tell you
Ctrip. All data in Elasticsearch system. What features? let me tell you...
How does Xiaohongshu marketing work? What system model? What is the marketing strategy? let me tell you
What does a hotel do. There is a market with a price of 10,000 yuan. I am using AP base point to decompose it. let me tell you