Gather statistics for CBO
This topic describes the basic concept of StarRocks cost-based optimizer (CBO) and how to collect statistics for the CBO to select an optimal query plan. StarRocks 2.4 introduces histograms to gather accurate data distribution statistics.
Since v3.2.0, StarRocks supports collecting statistics from Hive, Iceberg, and Hudi tables, reducing the dependency on other metastore systems. The syntax is similar to collecting StarRocks internal tables.
What is CBO
CBO is critical to query optimization. After an SQL query arrives at StarRocks, it is parsed into a logical execution plan. The CBO rewrites and transforms the logical plan into multiple physical execution plans. The CBO then estimates the execution cost of each operator in the plan (such as CPU, memory, network, and I/O) and chooses a query path with the lowest cost as the final physical plan.
StarRocks CBO was launched in StarRocks 1.16.0 and is enabled by default from 1.19 onwards. Developed upon the Cascades framework, StarRocks CBO estimates cost based on various statistical information. It is capable of choosing an execution plan with the lowest cost among tens of thousands of execution plans, significantly improving the efficiency and performance of complex queries.
Statistics are important for the CBO. They determine whether cost estimation is accurate and useful. The following sections detail the types of statistical information, the collection policy, and how to collect statistics and view statistical information.
Types of statistical information
StarRocks collects a variety of statistics as the input for cost estimation.
Basic statistics
By default, StarRocks periodically collects the following basic statistics of tables and columns:
-
row_count: total number of rows in a table
-
data_size: data size of a column
-
ndv: column cardinality, which is the number of distinct values in a column
-
null_count: the amount of data with NULL values in a column
-
min: minimum value in a column
-
max: maximum value in a column
Full statistics are stored in the column_statistics
of the _statistics_
database. You can view this table to query table statistics. Following is an example of querying statistics data from this table.
SELECT * FROM _statistics_.column_statistics\G
*************************** 1. row ***************************
table_id: 10174
partition_id: 10170
column_name: is_minor
db_id: 10169
table_name: zj_test.source_wiki_edit
partition_name: p06
row_count: 2
data_size: 2
ndv: NULL
null_count: 0
max: 1
min: 0