Manage Alerts
This topic introduces various alert items from different dimensions, including business continuity, cluster availability, and machine load, and provides corresponding resolutions.
In the following examples, all variables are prefixed with $
. They should be replaced according to your business environment. For example, $job_name
should be replaced with the corresponding Job Name in the Prometheus configuration, and $fe_leader
should be replaced with the IP address of the Leader FE.
Service Suspension Alerts
FE Service Suspension
PromSQL
count(up{group="fe", job="$job_name"}) >= 3
Alert Description
An alert is triggered when the number of active FE nodes falls below a specified value. You can adjust this value based on the actual number of FE nodes.
Resolution
Try to restart the suspended FE node.
BE Service Suspension
PromSQL
node_info{type="be_node_num", job="$job_name",state="dead"} > 1
Alert Description
An alert is triggered when more than one BE node is suspended.
Resolution
Try to restart the suspended BE node.
Machine Load Alerts
BE CPU Alert
PromSQL
(1-(sum(rate(starrocks_be_cpu{mode="idle", job="$job_name",instance=~".*"}[5m])) by (job, instance)) / (sum(rate(starrocks_be_cpu{job="$job_name",host=~".*"}[5m])) by (job, instance))) * 100 > 90
Alert Description
An alert is triggered when BE CPU Utilization exceeds 90%.
Resolution
Check whether there are large queries or large-scale data loading and forward the details to the support team for further investigation.
-
Use the
top
command to check resource usage by processes.top -Hp $be_pid
-
Use the
perf
command to collect and analyze performance data.# Execute the command for 1-2 minutes, and terminate it by pressing CTRL+C.
sudo perf top -p $be_pid -g >/tmp/perf.txt
In emergencies, to quickly restore service, you can try to restart the corresponding BE node after preserving the stack. An emergency here refers to a situation where the BE node's CPU utilization remains abnormally high, and no effective means are available to reduce CPU usage.
Memory Alert
PromSQL
(1-node_memory_MemAvailable_bytes{instance=~".*"}/node_memory_MemTotal_bytes{instance=~".*"})*100 > 90
Alert Description
An alert is triggered when memory usage exceeds 90%.
Resolution
Refer to the Get Heap Profile for troubleshooting.
- In emergencies, you can try to restart the corresponding BE service to restore the service. An emergency here refers to a situation where the BE node's memory usage remains abnormally high, and no effective means are available to reduce memory usage.
- If other mixed-deployed services are affecting the system, you may consider terminating those services in emergencies.
Disk Alerts
Disk Load Alert
PromSQL
rate(node_disk_io_time_seconds_total{instance=~".*"}[1m]) * 100 > 90
Alert Description
An alert is triggered when disk load exceeds 90%.
Resolution
If the cluster triggers a node_disk_io_time_seconds_total
alert, first check if there are any business changes. If so, consider rolling back the changes to maintain the previous resource balance. If no changes are identified or rollback is not possible, consider whether normal business growth is driving the need for resource expansion. You can use the iotop
tool to analyze disk I/O usage. iotop
has a UI similar to top
and includes information such as pid
, user
, and I/O.
You can also use the following SQL query to identify the tablets consuming significant I/O and trace them back to specific tasks and tables.
-- "all" indicates all services. 10 indicates the collection lasts 10 seconds. 3 indicates fetching the top 3 results.
ADMIN EXECUTE ON $backend_id 'System.print(ExecEnv.io_profile_and_get_topn_stats("all", 10, 3))';
Root Path Capacity Alert
PromSQL
node_filesystem_free_bytes{mountpoint="/"} /1024/1024/1024 < 5
Alert Description
An alert is triggered when the available space in the root directory is less than 5GB.
Resolution
Common directories that may occupy significant space include /var, **/**opt, and /tmp. Use the following command to check for large files and clear unnecessary files.
du -sh / --max-depth=1
Data Disk Capacity Alert
PromSQL
(SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host, path) - SUM(starrocks_be_disks_avail_capacity{job="$job"}) by (host, path)) / SUM(starrocks_be_disks_total_capacity{job="$job"}) by (host, path) * 100 > 90
Alert Description
An alert is triggered when disk capacity utilization exceeds 90%.
Resolution
-
Check if there have been changes in the loaded data volume.
Monitor the
load_bytes
metric in Grafana. If there has been a significant increase in data loading volume, you may need to scale the system resources. -
Check for any DROP operations.
If data loading volume has not changed much, run
SHOW BACKENDS
. If the reported disk usage does not match the actual usage, check the FE Audit Log for recent DROP DATABASE, TABLE, or PARTITION operations.Metadata for these operations remains in FE memory for one day, allowing you to restore data using the RECOVER statement within 24 hours to avoid misoperations. After recovery, the actual disk usage may exceed what is shown in
SHOW BACKENDS
.The retention period of deleted data in memory can be adjusted using the FE dynamic parameter
catalog_trash_expire_second
(default value: 86400).ADMIN SET FRONTEND CONFIG ("catalog_trash_expire_second"="86400");
To persist this change, add the configuration item to the FE configuration file fe.conf.
After that, deleted data will be moved to the trash directory on BE nodes (
$storage_root_path/trash
). By default, deleted data is kept in the trash directory for one day, which may also result in the actual disk usage exceeding what is shown inSHOW BACKENDS
.The retention time of deleted data in the trash directory can be adjusted using the BE dynamic parameter
trash_file_expire_time_sec
(default value: 86400).curl http://$be_ip:$be_http_port/api/update_config?trash_file_expire_time_sec=86400
FE Metadata Disk Capacity Alert
PromSQL
node_filesystem_free_bytes{mountpoint="${meta_path}"} /1024/1024/1024 < 10
Alert Description
An alert is triggered when the available disk space for FE metadata is less than 10GB.
Resolution
Use the following commands to check for directories occupying large amounts of space and clear unnecessary files. The metadata path is specified by the meta_dir
configuration in fe.conf.
du -sh /${meta_dir} --max-depth=1
If the metadata directory occupies a lot of space, it is usually because the bdb directory is large, possibly due to CheckPoint failure. Refer to the CheckPoint Failure Alert for troubleshooting. If this method does not solve the issue, contact the technical support team.
Cluster Service Exception Alerts
Compaction Failure Alerts
Cumulative Compaction Failure Alert
PromSQL
increase(starrocks_be_engine_requests_total{job="$job_name" ,status="failed",type="cumulative_compaction"}[1m]) > 3
increase(starrocks_be_engine_requests_total{job="$job_name" ,status="failed",type="base_compaction"}[1m]) > 3
Alert Description
An alert is triggered when there are three failures in Cumulative Compaction or Base Compaction within the last minute.
Resolution
Search the log of the corresponding BE node for the following keywords to identify the involved tablet.
grep -E 'compaction' be.INFO | grep failed
A log record like the following indicates a Compaction failure.
W0924 17:52:56:537041 123639 comaction_task_cpp:193] compaction task:8482. tablet:8423674 failed.
You can check the context of the log to analyze the failure. Typically, the failure may have been caused by a DROP TABLE or PARTITION operation during the Compaction process. The system has an internal retry mechanism for Compaction, and you can also manually set the tablet's status to BAD and trigger a Clone task to repair it.
Before performing the following operation, ensure that the table has at least three complete replicas.
ADMIN SET REPLICA STATUS PROPERTIES("tablet_id" = "$tablet_id", "backend_id" = "$backend_id", "status" = "bad");
High Compaction Pressure Alert
PromSQL
starrocks_fe_max_tablet_compaction_score{job="$job_name",instance="$fe_leader"} > 100
Alert Description
An alert is triggered when the highest Compaction Score exceeds 100, indicating high Compaction pressure.
Resolution
This alert is typically caused by frequent loading, INSERT INTO VALUES
, or DELETE
operations (at a rate of 1 per second). It is recommended to set the interval between loading or DELETE tasks to more than 5 seconds and avoid submitting high-concurrency DELETE tasks.
Exceeding Version Count Alert
PromSQL
starrocks_be_max_tablet_rowset_num{job="$job_name"} > 700
Alert Description
An alert is triggered when a tablet on a BE node has more than 700 data versions.
Resolution
Use the following command to check the tablet with excessive versions:
SELECT BE_ID,TABLET_ID FROM information_schema.be_tablets WHERE NUM_ROWSET>700;
Example for Tablet with ID 2889156
:
SHOW TABLET 2889156;
Execute the command returned in the DetailCmd
field:
SHOW PROC '/dbs/2601148/2889154/partitions/2889153/2889155/2889156';
Under normal circumstances, as shown, all three replicas should be in NORMAL
status, and other metrics like RowCount
and DataSize
should remain consistent. If only one replica exceeds the version limit of 700, you can trigger a Clone task based on other replicas using the following command:
ADMIN SET REPLICA STATUS PROPERTIES("tablet_id" = "$tablet_id", "backend_id" = "$backend_id", "status" = "bad");
If two or more replicas exceed the version limit, you can temporarily increase the version count limit:
# Replace be_ip with the IP of the BE node which stores the tablet that exceeds the version limit.
# The default be_http_port is 8040.
# The default value of tablet_max_versions is 1000.
curl -XPOST http://$be_ip:$be_http_port/api/update_config?tablet_max_versions=2000
CheckPoint Failure Alert
PromSQL
starrocks_fe_meta_log_count{job="$job_name",instance="$fe_master"} > 100000
Alert Description
An alert is triggered when the FE node's BDB log count exceeds 100,000. By default, the system performs a CheckPoint when the BDB log count exceeds 50,000, and then resets the count to 0.
Resolution
This alert indicates that a CheckPoint was not performed. You need to investigate the FE logs to analyze the CheckPoint process and resolve the issue:
In the fe.log of the Leader FE node, search for records like begin to generate new image: image.xxxx
. If found, it means the system has started generating a new image. Continue checking the logs for records like checkpoint finished save image.xxxx
to confirm successful image creation. If you find Exception when generate new image file
, the image generation failed. You should carefully handle the metadata based on the specific error. It is recommended to contact the support team for further analysis.
Excessive FE Thread Count Alert
PromSQL
starrocks_fe_thread_pool{job="$job_name"} > 3000
Alert Description
An alert is triggered when the number of threads on the FE exceeds 3000.
Resolution
The default thread count limit for FE and BE nodes is 4096. A large number of UNION ALL queries typically lead to an excessive thread count. It is recommended to reduce the concurrency of UNION ALL queries and adjust the system variable pipeline_dop
. If it is not possible to adjust SQL query granularity, you can globally adjust pipeline_dop
:
SET GLOBAL pipeline_dop=8;
In emergencies, to restore services quickly, you can increase the FE dynamic parameter thrift_server_max_worker_threads
(default value: 4096).
ADMIN SET FRONTEND CONFIG ("thrift_server_max_worker_threads"="8192");
High FE JVM Usage Alert
PromSQL
sum(jvm_heap_size_bytes{job="$job_name", type="used"}) * 100 / sum(jvm_heap_size_bytes{job="$job_name", type="max"}) > 90
Alert Description
An alert is triggered when the JVM usage on an FE node exceeds 90%.
Resolution
This alert indicates that JVM usage is too high. You can use the jmap
command to analyze the situation. Since detailed monitoring information for this metric is still under development, direct insights are limited. Perform the following actions and send the results to the support team for analysis:
# Note that specifying `live` in the command may cause FE to restart.
jmap -histo[:live] $fe_pid > jmap.dump
In emergencies, to quickly restore services, you can restart the corresponding FE node or increase the JVM (Xmx) size and then restart the FE service.