Compaction for Shared-data Clusters
This topic describes how to manage compaction in StarRocks shared-data clusters.
Overview
Each data loading operation in StarRocks generates a new version of data files. Compaction merges data files from different versions into larger files, reducing the number of small files and improving query efficiency.
Compaction Score
Overview
The Compaction Score reflects the merging status of data files in a partition. A higher score indicates lower merging progress, meaning the partition has more unmerged data file versions. FE maintains Compaction Score information for each partition, including the Max Compaction Score (the highest score among all tablets in the partition).
If a partition's Max Compaction Score is below the FE parameter lake_compaction_score_selector_min_score (default: 10), compaction for that partition is considered complete. A Max Compaction Score exceeding 100 indicates an unhealthy compaction state. When the score exceeds the FE parameter lake_ingest_slowdown_threshold (default: 100), the system slows down data loading transaction commits for the partition. If it surpasses lake_compaction_score_upper_bound (default: 2000), the system rejects import transactions for the partition.
Calculation Rules
Typically, each data file contributes 1 to the Compaction Score. For example, if a partition has one tablet and 10 data files generated from the first loading operation, the partition’s Max Compaction Score is 10. All data files generated by a transaction within a tablet are grouped as a Rowset.
During score calculation, a tablet’s Rowsets are grouped by size, and the group with the highest number of files determines the tablet’s Compaction Score.
For example, a tablet undergoes 7 loading operations, generating Rowsets with sizes: 100 MB, 100 MB, 100 MB, 10 MB, 10 MB, 10 MB, and 10 MB. During calculation, the system will make three 100 MB Rowsets into one group and four 10 MB Rowsets into another. The Compaction Score is calculated based on the group with more files. In this case, the second group has bigger compaction score. The compaction prioritizes the higher-scoring group, so after the first compaction, the Rowset diatribution would be: 100 MB, 100 MB, 100 MB, and 40 MB.
Compaction Workflow
For shared-data clusters, StarRocks introduces a new FE-controlled compaction mechanism:
- Score Calculation: The Leader FE node calculates and stores Compaction Scores for partitions based on transaction publish results.
- Candidate Selection: FE selects partitions with the highest Max Compaction Scores as compaction candidates.
- Task Generation: FE initiates compaction transactions for selected partitions, generates tablet-level subtasks, and dispatches them to Compute Nodes (CNs) until reaching the limit set by the FE parameter
lake_compaction_max_tasks. - Subtask Execution: CNs execute compaction subtasks in the background. The number of concurrent subtasks per CN is controlled by the CN parameter
compact_threads. - Result Collection: FE aggregates subtask results and commits the compaction transaction.
- Publish: FE publishes the successfully committed compaction transaction.
Manage compaction
View compaction scores
-
You can view the compaction scores of partitions in a specific table by using the SHOW PROC statement. Typically, you only need to focus on the
MaxCSfield. IfMaxCSis below 10, compaction is considered complete. IfMaxCSis above 100, the Compaction Score is relatively high. IfMaxCSexceeds 500, the Compaction Score is very high and manual intervention may be required.SHOW PARTITIONS FROM <table_name>
SHOW PROC '/dbs/<database_name>/<table_name>/partitions'Example:
mysql> SHOW PROC '/dbs/load_benchmark/store_sales/partitions';
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------+------------------------------+---------+----------+-----------+----------+------------+-------+-------+-------+
| PartitionId | PartitionName | CompactVersion | VisibleVersion | NextVersion | State | PartitionKey | Range | DistributionKey | Buckets | DataSize | RowCount | CacheTTL | AsyncWrite | AvgCS | P50CS | MaxCS |
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------+------------------------------+---------+----------+-----------+----------+------------+-------+-------+-------+
| 38028 | store_sales | 913 | 921 | 923 | NORMAL | | | ss_item_sk, ss_ticket_number | 64 | 15.6GB | 273857126 | 2592000 | false | 10.00 | 10.00 | 10.00 |
+-------------+---------------+----------------+----------------+-------------+--------+--------------+-------+------------------------------+---------+----------+-----------+----------+------------+-------+-------+-------+
1 row in set (0.20 sec) -
You can also view the partition compaction scores by querying the system-defined view
information_schema.partitions_meta.Example:
mysql> SELECT * FROM information_schema.partitions_meta ORDER BY Max_CS LIMIT 10;
+--------------+----------------------------+----------------------------+--------------+-----------------+-----------------+----------------------+--------------+---------------+-----------------+-----------------------------------------+---------+-----------------+----------------+---------------------+-----------------------------+--------------+---------+-----------+------------+------------------+----------+--------+--------+-------------------------------------------------------------------+
| DB_NAME | TABLE_NAME | PARTITION_NAME | PARTITION_ID | COMPACT_VERSION | VISIBLE_VERSION | VISIBLE_VERSION_TIME | NEXT_VERSION | PARTITION_KEY | PARTITION_VALUE | DISTRIBUTION_KEY | BUCKETS | REPLICATION_NUM | STORAGE_MEDIUM | COOLDOWN_TIME | LAST_CONSISTENCY_CHECK_TIME | IS_IN_MEMORY | IS_TEMP | DATA_SIZE | ROW_COUNT | ENABLE_DATACACHE | AVG_CS | P50_CS | MAX_CS | STORAGE_PATH |
+--------------+----------------------------+----------------------------+--------------+-----------------+-----------------+----------------------+--------------+---------------+-----------------+-----------------------------------------+---------+-----------------+----------------+---------------------+-----------------------------+--------------+---------+-----------+------------+------------------+----------+--------+--------+-------------------------------------------------------------------+
| tpcds_1t | call_center | call_center | 11905 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | cc_call_center_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 12.3KB | 42 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11906/11905 |
| tpcds_1t | web_returns | web_returns | 12030 | 3 | 3 | 2024-03-17 08:40:48 | 4 | | | wr_item_sk, wr_order_number | 16 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 3.5GB | 71997522 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/12031/12030 |
| tpcds_1t | warehouse | warehouse | 11847 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | w_warehouse_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 4.2KB | 20 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11848/11847 |
| tpcds_1t | ship_mode | ship_mode | 11851 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | sm_ship_mode_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 1.7KB | 20 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11852/11851 |
| tpcds_1t | customer_address | customer_address | 11790 | 0 | 2 | 2024-03-17 08:32:19 | 3 | | | ca_address_sk | 16 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 120.9MB | 6000000 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11791/11790 |
| tpcds_1t | time_dim | time_dim | 11855 | 0 | 2 | 2024-03-17 08:30:48 | 3 | | | t_time_sk | 16 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 864.7KB | 86400 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11856/11855 |
| tpcds_1t | web_sales | web_sales | 12049 | 3 | 3 | 2024-03-17 10:14:20 | 4 | | | ws_item_sk, ws_order_number | 128 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 47.7GB | 720000376 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/12050/12049 |
| tpcds_1t | store | store | 11901 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | s_store_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 95.6KB | 1002 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11902/11901 |
| tpcds_1t | web_site | web_site | 11928 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | web_site_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 13.4KB | 54 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11929/11928 |
| tpcds_1t | household_demographics | household_demographics | 11932 | 0 | 2 | 2024-03-17 08:30:47 | 3 | | | hd_demo_sk | 1 | 1 | HDD | 9999-12-31 23:59:59 | NULL | 0 | 0 | 2.1KB | 7200 | 0 | 0 | 0 | 0 | s3://XXX/536a3c77-52c3-485a-8217-781734a970b1/db10328/11933/11932 |
+--------------+----------------------------+----------------------------+--------------+-----------------+-----------------+----------------------+--------------+---------------+-----------------+-----------------------------------------+---------+-----------------+----------------+---------------------+-----------------------------+--------------+---------+-----------+------------+------------------+----------+--------+--------+-------------------------------------------------------------------+
View compaction tasks
As new data is loading to the system, FE constantly schedules compaction tasks to be executed on different CN nodes. You can first view the general status of compaction tasks on FE, and then view the execution details of each tasks on CN.
View general status of compaction tasks
You can view the general status of compaction tasks using the SHOW PROC statement.
SHOW PROC '/compactions';
Example:
mysql> SHOW PROC '/compactions';
+---------------------+-------+---------------------+---------------------+---------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Partition | TxnID | StartTime | CommitTime | FinishTime | Error | Profile |
+---------------------+-------+---------------------+---------------------+---------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ssb.lineorder.10081 | 15 | 2026-01-10 03:29:07 | 2026-01-10 03:29:11 | 2026-01-10 03:29:12 | NULL | {"sub_task_count":12,"read_local_sec":0,"read_local_mb":218,"read_remote_sec":0,"read_remote_mb":0,"read_segment_count":120,"write_segment_count":12,"write_segment_mb":219,"write_remote_sec":4,"in_queue_sec":18} |
| ssb.lineorder.10068 | 16 | 2026-01-10 03:29:07 | 2026-01-10 03:29:13 | 2026-01-10 03:29:14 | NULL | {"sub_task_count":12,"read_local_sec":0,"read_local_mb":218,"read_remote_sec":0,"read_remote_mb":0,"read_segment_count":120,"write_segment_count":12,"write_segment_mb":218,"write_remote_sec":4,"in_queue_sec":38} |
| ssb.lineorder.10055 | 20 | 2026-01-10 03:29:11 | 2026-01-10 03:29:15 | 2026-01-10 03:29:17 | NULL | {"sub_task_count":12,"read_local_sec":0,"read_local_mb":218,"read_remote_sec":0,"read_remote_mb":0,"read_segment_count":120,"write_segment_count":12,"write_segment_mb":218,"write_remote_sec":4,"in_queue_sec":23} |
+---------------------+-------+---------------------+---------------------+---------------------+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The following fields are returned:
Partition: The partition to which the compaction task belongs.TxnID: The transaction ID assigned to the compaction task.StartTime: The time when the compaction task starts.NULLindicates that the task has not yet been initiated.CommitTime: The time when the compaction task commits the data.NULLindicates that the data has not yet been committed.FinishTime: The time when the compaction task publishes the data.NULLindicates that the data has not yet been published.Error: The error message (if any) of the compaction task.Profile: (supported from v3.2.12 and v3.3.4) The Profile of the compaction task after finished.sub_task_count: The number of sub-tasks (equivalent to tablets) in the partition.read_local_sec: The total time consumption of all sub-tasks on reading data from the local cache. Unit: Seconds.read_local_mb: The total size of data read from the local cache by all sub-tasks. Unit: MB.read_remote_sec: The total time consumption of all sub-tasks on reading data from the remote storage. Unit: Seconds.read_remote_mb: The total size of data read from the remote storage by all sub-tasks. Unit: MB.read_segment_count: The total number of files read by all sub-tasks.write_segment_count: The total number of new files generated by all sub-tasks.write_segment_mb: The total size of new files generated by all sub-tasks. Unit: MB.write_remote_sec: The total time consumption of all sub-tasks on writing data to the remote storage. Unit: Seconds.in_queue_sec: The total time of all sub-tasks staying in the queue. Unit: Seconds.
View execution details of compaction tasks
Each compaction task is divided into multiple sub-tasks, each of which corresponds to a tablet. You can view the execution details of each sub-task by querying the system-defined view information_schema.be_cloud_native_compactions.
Example:
mysql> SELECT * FROM information_schema.be_cloud_native_compactions;
+-------+--------+-----------+---------+---------+------+---------------------+-------------+----------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| BE_ID | TXN_ID | TABLET_ID | VERSION | SKIPPED | RUNS | START_TIME | FINISH_TIME | PROGRESS | STATUS | PROFILE |
+-------+--------+-----------+---------+---------+------+---------------------+-------------+----------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10001 | 51047 | 43034 | 12 | 0 | 1 | 2024-09-24 19:15:15 | NULL | 82 | | {"read_local_sec":0,"read_local_mb":31,"read_remote_sec":0,"read_remote_mb":0,"read_remote_count":0,"read_local_count":1900,"segment_init_sec":0,"column_iterator_init_sec":0,"in_queue_sec":0} |
| 10001 | 51048 | 43032 | 12 | 0 | 1 | 2024-09-24 19:15:15 | NULL | 82 | | {"read_local_sec":0,"read_local_mb":32,"read_remote_sec":0,"read_remote_mb":0,"read_remote_count":0,"read_local_count":1900,"segment_init_sec":0,"column_iterator_init_sec":0,"in_queue_sec":0} |
| 10001 | 51049 | 43033 | 12 | 0 | 1 | 2024-09-24 19:15:15 | NULL | 82 | | {"read_local_sec":0,"read_local_mb":31,"read_remote_sec":0,"read_remote_mb":0,"read_remote_count":0,"read_local_count":1900,"segment_init_sec":0,"column_iterator_init_sec":0,"in_queue_sec":0} |
| 10001 | 51051 | 43038 | 9 | 0 | 1 | 2024-09-24 19:15:15 | NULL | 84 | | {"read_local_sec":0,"read_local_mb":31,"read_remote_sec":0,"read_remote_mb":0,"read_remote_count":0,"read_local_count":1900,"segment_init_sec":0,"column_iterator_init_sec":0,"in_queue_sec":0} |
| 10001 | 51052 | 43036 | 12 | 0 | 0 | NULL | NULL | 0 | | |
| 10001 | 51053 | 43035 | 12 | 0 | 1 | 2024-09-24 19:15:16 | NULL | 2 | | {"read_local_sec":0,"read_local_mb":1,"read_remote_sec":0,"read_remote_mb":0,"read_remote_count":0,"read_local_count":100,"segment_init_sec":0,"column_iterator_init_sec":0,"in_queue_sec":0} |
+-------+--------+-----------+---------+---------+------+---------------------+-------------+----------+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The following fields are returned:
BE_ID: The ID of the CN.TXN_ID: The ID of transaction to which the sub-task belongs.TABLET_ID: The ID of tablet to which the sub-task belongs.VERSION: The version of the tablet.RUNS: The number of times the sub-task has been executed.START_TIME: The time when the sub-task starts.FINISH_TIME: The time when the sub-task finishes.PROGRESS: The compaction progress of the tablet in percentage.STATUS: The status of the sub-task. Error messages will be returned in this field if there is an error.PROFILE: (supported from v3.2.12 and v3.3.4) The runtime profile of the sub-task.read_local_sec: The time consumption of the sub-task on reading data from the local cache. Unit: Seconds.read_local_mb: The size of data read from the local cache by the sub-task. Unit: MB.read_remote_sec: The time consumption of the sub-task on reading data from the remote storage. Unit: Seconds.read_remote_mb: The size of data read from the remote storage by the sub-task. Unit: MB.read_local_count: The number of times the sub-task reads data from the local cache.read_remote_count: The number of times the sub-task reads data from the remote storage.in_queue_sec: The time of the sub-task staying in queue. Unit: Seconds.
Configure compaction tasks
You can configure compaction tasks using these FE and CN (BE) parameters.
FE parameter
You can configure the following FE parameter dynamically.
ADMIN SET FRONTEND CONFIG ("lake_compaction_max_tasks" = "-1");
lake_compaction_max_tasks
- Default: -1
- Type: Int
- Unit: -
- Is mutable: Yes
- Description: The maximum number of concurrent Compaction tasks allowed in a shared-data cluster. Setting this item to
-1indicates calculating the concurrent task number in an adaptive manner, that is, the number of surviving CN nodes multiplied by 16. Setting this value to0will disable compaction. - Introduced in: v3.1.0
ADMIN SET FRONTEND CONFIG ("lake_compaction_disable_tables" = "11111;22222");
lake_compaction_disable_tables
- Default:""
- Type:String
- Unit:-
- Is mutable:Yes
- Description:Disable compaction for certain tables. This will not affect compaction that has started. The value of this item is table ID. Multiple values are separated by ';'.
- Introduced in:v3.2.7
CN parameters
You can configure the following CN parameter dynamically.
UPDATE information_schema.be_configs SET VALUE = 8
WHERE name = "compact_threads";
compact_threads
- Default: 4
- Type: Int
- Unit: -
- Is mutable: Yes
- Description: The maximum number of threads used for concurrent compaction tasks. This configuration is changed to dynamic from v3.1.7 and v3.2.2 onwards.
- Introduced in: v3.0.0
NOTE
In production, it is recommended to set
compact_threadsto 25% of the BE/CN CPU core count.
max_cumulative_compaction_num_singleton_deltas
- Default: 500
- Type: Int
- Unit: -
- Is mutable: Yes
- Description: The maximum number of segments that can be merged in a single Cumulative Compaction. You can reduce this value if OOM occurs during compaction.
- Introduced in: -
NOTE
In production, it is recommended to set
max_cumulative_compaction_num_singleton_deltasto100to accelerate the compaction tasks and reduce their recource consumption.
lake_pk_compaction_max_input_rowsets
- Default: 500
- Type: Int
- Unit: -
- Is mutable: Yes
- Description: The maximum number of input rowsets allowed in a Primary Key table compaction task in a shared-data cluster. The default value of this parameter is changed from
5to1000since v3.2.4 and v3.1.10, and to500since since v3.3.1 and v3.2.9. After the Sized-tiered Compaction policy is enabled for Primary Key tables (by settingenable_pk_size_tiered_compaction_strategytotrue), StarRocks does not need to limit the number of rowsets for each compaction to reduce write amplification. Therefore, the default value of this parameter is increased. - Introduced in: v3.1.8, v3.2.3
NOTE
In production, it is recommended to set
max_cumulative_compaction_num_singleton_deltasto100to accelerate the Compaction tasks and reduce their resource consumption.
Manually trigger compaction tasks
-- Trigger compaction for the whole table.
ALTER TABLE <table_name> COMPACT;
-- Trigger compaction for a specific partition.
ALTER TABLE <table_name> COMPACT <partition_name>;
-- Trigger compaction for multiple partitions.
ALTER TABLE <table_name> COMPACT (<partition_name>, <partition_name>, ...);
Cancel compaction tasks
You can manually cancel a compaction task using the transaction ID of the task.
CANCEL COMPACTION WHERE TXN_ID = <TXN_ID>;
NOTE
- The CANCEL COMPACTION statement must be submitted from the Leader FE node.
- The CANCEL COMPACTION statement only applies to transactions that have not committed, that is,
CommitTimeis NULL in the return ofSHOW PROC '/compactions'.- CANCEL COMPACTION is an asynchronous process. You can check if the task is cancelled by executing
SHOW PROC '/compactions'.
Best practices
Since Compaction is crucial for query performance, it is recommended to regularly monitor the data merging status of tables and partitions. Here are some best practices and guidelines:
- Try to increase the time interval between loading (avoid scenarios with intervals less than 10 seconds) and increase the batch size per load (avoid batch sizes smaller than 100 rows of data).
- Adjust the number of parallel compaction worker threads on CN to accelerate task execution. It is recommended to set
compact_threadsto 25% of the BE/CN CPU core count in a production environment. - Monitor the Compaction task status using
show proc '/compactions'andselect * from information_schema.be_cloud_native_compactions;. - Monitor the Compaction Score, and configure alerts based on it. StarRocks' built-in Grafana monitoring template includes this metric.
- Pay attention to the resource consumption during compaction, especially memory usage. The Grafana monitoring template also includes this metric.
Troubleshooting
Slow queries
To identify slow queries caused by untimely Compaction, you can check, in the SQL Profile, the value of SegmentsReadCount divided by TabletCount within a single Fragment. If it is an large value, such as tens or more, untimely Compaction may be the cause of the slow query.
High Max Compaction Score in the cluster
- Check whether the Compaction-related parameters are within reasonable ranges using
ADMIN SHOW FRONTEND CONFIG LIKE "%lake_compaction%"andSELECT * FROM information_schema.be_configs WHERE name = "compact_threads". - Check if Compaction is stuck using
SHOW PROC '/compactions':- If
CommitTimeremains NULL, check the system viewinformation_schema.be_cloud_native_compactionsfor the reason why Compaction is stuck. - If
FinishTimeremains NULL, search for the Publish failure reason in the Leader FE log usingTxnID.
- If
- Check if compaction is running slowly using
SHOW PROC '/compactions':- If
sub_task_countis too large (check the size of each tablet in this partition usingSHOW PARTITIONS), the table may be created improperly. - If
read_remote_mbis too large (more than 30% of the total read data), check the server disk size and also check the cache quota throughSHOW BACKENDSfor fieldDataCacheMetrics. - If
write_remote_secis too large (more than 90% of the total Compaction time), write to the remote storage may be too slow. This can be verified by checking the shared-data-specific monitoring metrics with keywordssingle upload latencyandmulti upload latency. - If
in_queue_secis too large (average waiting time per tablet exceeds 60 seconds), the parameter settings may be unreasonable or other running Compactions are too slow.
- If