Generated columns
Since v3.1, StarRocks supports generated columns. Generated columns can be used to accelerate queries with complex expressions. This feature supports precomputing and storing the results of expressions and query rewrites, which significantly accelerates queries with the same complex expressions.
You can define one or more generated columns to store the results of expressions at table creation. As such, when executing queries that contain the expression whose results are stored in the generated column you have defined, the CBO rewrites the query to read data directly from the generated column. Alternatively, you can directly query the data in the generated column.
It is also recommended to evaluate the impact of generated columns on loading performance because computing expressions takes some time. Additionally, it is advised to create generated columns at table creation rather than adding or modifying them after table creation. Because it is time-consuming and costly to add or modify generated columns after table creation.
However, note that when data is loaded into tables with a generated column, the time and overhead may increase because StarRocks needs to perform additional computation to calculate the results based on the expression and writes the results into the generated column.
Currently, StarRocks's shared-data mode does not support generated columns.
Basic operations
Create generated columns
Syntax
<col_name> <data_type> [NULL] AS <expr> [COMMENT 'string']
Create generated columns at table creation
Create a table named test_tbl1
with five columns of which columns newcol1
and newcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columns data_array
and data_json
respectively.
CREATE TABLE test_tbl1
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NOT NULL,
newcol1 DOUBLE AS array_avg(data_array),
newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
NOTICE:
- Generated columns must be defined after regular columns.
- Aggregate functions cannot be used in the expressions for generated columns.
- The expressions for generated column cannot reference other generated columns or auto-increment columns, but the expressions can reference multiple regular columns.
- The data type of a generated column must match the data type of the result generated by the expression for the generated column.
- Generated columns cannot be created on Aggregate tables.
Add generated columns after table creation
In most cases, the expressions frequently used during querying are determined after table creation, so generated columns are often added after table creation. For performance purposes, the underlying logic for StarRocks to add generated columns after table creation is optimized. As such, when adding a generated column, StarRocks does not need to rewrite all the data. Instead, StarRocks only needs to write the data of the newly added generated column and associate that data with the existing physical data files, which greatly improves the efficiency of adding generated columns after table creation.
-
Create a table named
test_tbl2
with three regular columnsid
,data_array
, anddata_json
. Insert a data row into the table.-- Create a table.
CREATE TABLE test_tbl2
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NOT NULL
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
-- Insert a data row.
INSERT INTO test_tbl2 VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
-- Query the table.
MySQL [example_db]> select * from test_tbl2;
+------+------------+------------------+
| id | data_array | data_json |
+------+------------+------------------+
| 1 | [1,2] | {"a": 1, "b": 2} |
+------+------------+------------------+
1 row in set (0.04 sec) -
Execute ALTER TABLE ... ADD COLUMN ... to add generated columns
newcol1
andnewcol2
, which are created by evaluating the expressions based on the values of regular columnsdata_array
anddata_json
.ALTER TABLE test_tbl2
ADD COLUMN newcol1 DOUBLE AS array_avg(data_array);
ALTER TABLE test_tbl2
ADD COLUMN newcol2 String AS json_string(json_query(data_json, "$.a"));NOTICE:
- Adding generated columns to Aggregate tables is not supported.
- Regular columns need to be defined before generated columns. When you use the ALTER TABLE ... ADD COLUMN ... statement to add a regular column without specifying the position of the new regular column, the system automatically places it before the generated columns. Moreover, you cannot use AFTER to explicitly place the regular column after a generated column.
-
Query the table data.
MySQL [example_db]> SELECT * FROM test_tbl2;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 |
+------+------------+------------------+---------+---------+
1 row in set (0.04 sec)The result shows that the generated columns
newcol1
andnewcol2
are added to the table, and StarRocks automatically computes their values based on the expression.
Load data into generated columns
During data loading, StarRocks automatically calculates the values for generated columns based on the expressions. You cannot specify the values of generated columns. The following example uses the INSERT INTO statement to load data:
-
Use INSERT INTO to insert a record into the
test_tbl1
table. Note that you cannot specify the values for the generated columns within theVALUES ()
clause.INSERT INTO test_tbl1 (id, data_array, data_json)
VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}')); -
Query the table data.
MySQL [example_db]> SELECT * FROM test_tbl1;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 |
+------+------------+------------------+---------+---------+
1 row in set (0.01 sec)The results show that StarRocks automatically computes the values for the generated columns
newcol1
andnewcol2
based on the expressions.NOTICE:
The following error is returned if you specify values for the generated columns during data loading:
MySQL [example_db]> INSERT INTO test_tbl1 (id, data_array, data_json, newcol1, newcol2)
VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
ERROR 1064 (HY000): Getting analyzing error. Detail message: materialized column 'newcol1' can not be specified.
MySQL [example_db]> INSERT INTO test_tbl1 VALUES (2, [3,4], parse_json('{"a" : 3, "b" : 4}'), 3.5, "3");
ERROR 1064 (HY000): Getting analyzing error. Detail message: Column count doesn't match value count.
Modify generated columns
When modifying a generated column, StarRocks needs to rewrite all the data, which is time-consuming and resource-intensive. If it is unavoidable to use ALTER TABLE to modify generated columns, it is recommended to evaluate the cost and time involved in advance.
You can modify the data type and expression of a generated column.
-
Create a table
test_tbl3
with five columns of which columnsnewcol1
andnewcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columnsdata_array
anddata_json
respectively. Insert a data row into the table.-- Create a table.
MySQL [example_db]> CREATE TABLE test_tbl3
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NOT NULL,
-- The data types and expressions of generated columns are specified as follows:
newcol1 DOUBLE AS array_avg(data_array),
newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
-- Insert a data row.
INSERT INTO test_tbl3 (id, data_array, data_json)
VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
-- Query the table.
MySQL [example_db]> select * from test_tbl3;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 |
+------+------------+------------------+---------+---------+
1 row in set (0.01 sec) -
Modified the generated columns
newcol1
andnewcol2
:-
Change the data type of the generated column
newcol1
toARRAY<INT>
and change its expression todata_array
.ALTER TABLE test_tbl3
MODIFY COLUMN newcol1 ARRAY<INT> AS data_array; -
Modify the expression of the generated column
newcol2
to extract the values of fieldb
from the regular columndata_json
.ALTER TABLE test_tbl3
MODIFY COLUMN newcol2 String AS json_string(json_query(data_json, "$.b"));
-
-
View the modified schema and the data in the table.
-
View the modified schema.
MySQL [example_db]> show create table test_tbl3\G
**** 1. row ****
Table: test_tbl3
Create Table: CREATE TABLE test_tbl3 (
id int(11) NOT NULL COMMENT "",
data_array array<int(11)> NOT NULL COMMENT "",
data_json json NOT NULL COMMENT "",
-- After modification, the data types and expressions of generated columns are as follows:
newcol1 array<int(11)> NULL AS example_db.test_tbl3.data_array COMMENT "",
newcol2 varchar(65533) NULL AS json_string(json_query(example_db.test_tbl3.data_json, '$.b')) COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(id)
DISTRIBUTED BY HASH(id)
PROPERTIES (...);
1 row in set (0.00 sec) -
Query the table data after modification. The result shows that StarRocks recalculates the values of the generated columns
newcol1
andnewcol2
based on the modified expressions.MySQL [example_db]> select * from test_tbl3;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [1,2] | {"a": 1, "b": 2} | [1,2] | 2 |
+------+------------+------------------+---------+---------+
1 row in set (0.01 sec)
-
Drop a generated column
Drop column newcol1
from the table test_tbl3
ALTER TABLE test_tbl3 DROP COLUMN newcol1;
NOTICE:
If a generated colum references a regular column in the expression, you cannot directly drop or modify that regular column. Instead, you need to first drop the generated column and then drop or modify the regular column.
Query rewrites
If the expression in a query matches the expression of a generated column, the optimizer automatically rewrites the query to directly read the values of the generated column.
-
Suppose that you create a table
test_tbl4
with the following schema:CREATE TABLE test_tbl4
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NOT NULL,
newcol1 DOUBLE AS array_avg(data_array),
newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id) DISTRIBUTED BY HASH(id); -
If you query the data in the table
test_tbl4
by using theSELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;
statement, the query involves only the regular columnsdata_array
anddata_json
. However, the expressions in the query match the expressions of the generated columnsnewcol1
andnewcol2
. In this case, the execution plan shows that the CBO automatically rewrites the query to read the values of the generated columnsnewcol1
andnewcol2
.MySQL [example_db]> EXPLAIN SELECT array_avg(data_array), json_string(json_query(data_json, "$.a")) FROM test_tbl4;
+---------------------------------------+
| Explain String |
+---------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS:4: newcol1 | 5: newcol2 | -- The query is rewritten to read data from the generated columns newcol1 and newcol2 are accessed.
| PARTITION: RANDOM |
| |
| RESULT SINK |
| |
| 0:OlapScanNode |
| TABLE: test_tbl4 |
| PREAGGREGATION: ON |
| partitions=0/1 |
| rollup: test_tbl4 |
| tabletRatio=0/0 |
| tabletList= |
| cardinality=1 |
| avgRowSize=2.0 |
+---------------------------------------+
15 rows in set (0.00 sec)
Partial updates and generated columns
To perform partial updates on a Primary Key table, you must specify all the regular columns referenced by the generated columns in the columns
parameter. The following example uses Stream Load to perform partial updates.
-
Create a table
test_tbl5
with five columns of which columnsnewcol1
andnewcol2
are generated columns whose values are computed by using the specified expressions and referencing the values of the regular columnsdata_array
anddata_json
respectively. Insert a data row into the table.-- Create a table.
CREATE TABLE test_tbl5
(
id INT NOT NULL,
data_array ARRAY<int> NOT NULL,
data_json JSON NULL,
newcol1 DOUBLE AS array_avg(data_array),
newcol2 String AS json_string(json_query(data_json, "$.a"))
)
PRIMARY KEY (id)
DISTRIBUTED BY HASH(id);
-- Insert into a data row.
INSERT INTO test_tbl5 (id, data_array, data_json)
VALUES (1, [1,2], parse_json('{"a" : 1, "b" : 2}'));
-- Query the table.
MySQL [example_db]> select * from test_tbl5;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [1,2] | {"a": 1, "b": 2} | 1.5 | 1 |
+------+------------+------------------+---------+---------+
1 row in set (0.01 sec) -
Prepare a CSV file
my_data1.csv
to update some columns in thetest_tbl5
table.1|[3,4]|{"a": 3, "b": 4}
2|[3,4]|{"a": 3, "b": 4} -
Use Stream Load with the
my_data1.csv
file to update some columns of thetest_tbl5
table. You need to setpartial_update:true
and specify all the regular columns referenced by the generated columns in thecolumns
parameter.curl --location-trusted -u <username>:<password> -H "label:1" \
-H "column_separator:|" \
-H "partial_update:true" \
-H "columns:id,data_array,data_json" \
-T my_data1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/example_db/test_tbl5/_stream_load -
Query the table data.
[example_db]> select * from test_tbl5;
+------+------------+------------------+---------+---------+
| id | data_array | data_json | newcol1 | newcol2 |
+------+------------+------------------+---------+---------+
| 1 | [3,4] | {"a": 3, "b": 4} | 3.5 | 3 |
| 2 | [3,4] | {"a": 3, "b": 4} | 3.5 | 3 |
+------+------------+------------------+---------+---------+
2 rows in set (0.01 sec)
An error is returned by Stream Load if you perform partial updates without specifying all the regular columns referenced by the generated columns.
-
Prepare a CSV file
my_data2.csv
.1|[3,4]
2|[3,4] -
When partial column updates are performed by using Stream Load with the
my_data2.csv
file, if the values for thedata_json
column are not provided inmy_data2.csv
and thecolumns
parameter in the Stream Load job does not include thedata_json
column, even if thedata_json
column allows null values, an error is returned by Stream Load because the columndata_json
is referenced by the generated columnnewcol2
.