Use S3 for shared-data
This topic describes how to deploy and use a shared-data StarRocks cluster. This feature is supported from v3.0 for S3 compatible storage and v3.1 for Azure Blob Storage.
NOTE
- StarRocks version 3.1 brings some changes to the shared-data deployment and configuration. Please use this document if you are running version 3.1 or higher.
- If you are running version 3.0 please use the 3.0 documentation.
- Shared-data StarRocks clusters do not support data BACKUP and RESTORE.
The shared-data StarRocks cluster is specifically engineered for the cloud on the premise of separation of storage and compute. It allows data to be stored in object storage (for example, AWS S3, Google GCS, Azure Blob Storage, and MinIO). You can achieve not only cheaper storage and better resource isolation, but elastic scalability for your cluster. The query performance of the shared-data StarRocks cluster aligns with that of a shared-nothing StarRocks cluster when the local disk cache is hit.
In version 3.1 and higher the StarRocks shared-data cluster is made up of Frontend Engines (FEs) and Compute Nodes (CNs). The CNs replace the classic Backend Engines (BEs) in shared-data clusters.
Compared to the classic shared-nothing StarRocks architecture, separation of storage and compute offers a wide range of benefits. By decoupling these components, StarRocks provides:
- Inexpensive and seamlessly scalable storage.
- Elastic scalable compute. Because data is not stored in Compute Nodes (CNs), scaling can be done without data migration or shuffling across nodes.
- Local disk cache for hot data to boost query performance.
- Asynchronous data ingestion into object storage, allowing a significant improvement in loading performance.
Architecture
Deploy a shared-data StarRocks cluster
The deployment of a shared-data StarRocks cluster is similar to that of a shared-nothing StarRocks cluster. The only difference is that you need to deploy CNs instead of BEs in a shared-data cluster. This section only lists the extra FE and CN configuration items you need to add in the configuration files of FE and CN fe.conf and cn.conf when you deploy a shared-data StarRocks cluster. For detailed instructions on deploying a StarRocks cluster, see Deploy StarRocks.
NOTE
Do not start the cluster until after it is configured for shared-storage in the next section of this document.
Configure FE nodes for shared-data StarRocks
Before starting the cluster configure the FEs and CNs. Example configurations are provided below, and then the details for each parameter are provided.
Example FE configurations for S3
These are example shared-data additions for your fe.conf
file on each of your FE nodes. The examples differ based on the AWS authentication method being used.
Default authentication credentials
run_mode = shared_data
cloud_native_meta_port = <meta_port>
cloud_native_storage_type = S3
# For example, testbucket/subpath
aws_s3_path = <s3_path>
# For example, us-west-2
aws_s3_region = <region>
# For example, https://s3.us-west-2.amazonaws.com
aws_s3_endpoint = <endpoint_url>
aws_s3_use_aws_sdk_default_behavior = true
# Set this to false if you do not want default
# storage created in the object storage using
# the details provided above
enable_load_volume_from_conf = true
IAM user-based credentials
run_mode = shared_data
cloud_native_meta_port = <meta_port>
cloud_native_storage_type = S3
# For example, testbucket/subpath
aws_s3_path = <s3_path>
# For example, us-west-2
aws_s3_region = <region>
# credentials for S3 object read/write
aws_s3_access_key = <access_key>
aws_s3_secret_key = <secret_key>
# Set this to false if you do not want default
# storage created in the object storage using
# the details provided above
enable_load_volume_from_conf = true
Instance profile
run_mode = shared_data
cloud_native_meta_port = <meta_port>
cloud_native_storage_type = S3
# For example, testbucket/subpath
aws_s3_path = <s3_path>
# For example, us-west-2
aws_s3_region = <region>
# For example, https://s3.us-west-2.amazonaws.com
aws_s3_endpoint = <endpoint_url>
aws_s3_use_instance_profile = true
# Set this to false if you do not want default
# storage created in the object storage using
# the details provided above
enable_load_volume_from_conf = true
Please make sure you have granted access to both FE and CN nodes in the cluster. FE nodes cannot delegate access to CN nodes.
Assumed role
run_mode = shared_data
cloud_native_meta_port = <meta_port>
cloud_native_storage_type = S3
# For example, testbucket/subpath
aws_s3_path = <s3_path>
# For example, us-west-2
aws_s3_region = <region>
# For example, https://s3.us-west-2.amazonaws.com
aws_s3_endpoint = <endpoint_url>
aws_s3_use_instance_profile = true
aws_s3_iam_role_arn = <role_arn>
# Set this to false if you do not want default
# storage created in the object storage using
# the details provided above
enable_load_volume_from_conf = true
Please make sure you have granted access to both FE and CN nodes in the cluster. FE nodes cannot delegate access to CN nodes.
Assumed role from an external account
run_mode = shared_data
cloud_native_meta_port = <meta_port>
cloud_native_storage_type = S3
# For example, testbucket/subpath
aws_s3_path = <s3_path>
# For example, us-west-2
aws_s3_region = <region>
# For example, https://s3.us-west-2.amazonaws.com
aws_s3_endpoint = <endpoint_url>
aws_s3_use_instance_profile = true
aws_s3_iam_role_arn = <role_arn>
aws_s3_external_id = <external_id>
# Set this to false if you do not want default
# storage created in the object storage using
# the details provided above
enable_load_volume_from_conf = true
Please make sure you have granted access to both FE and CN nodes in the cluster. FE nodes cannot delegate access to CN nodes.
All FE parameters related to shared-storage with S3
run_mode
The running mode of the StarRocks cluster. Valid values:
shared_data
shared_nothing
(Default)
NOTE
- You cannot adopt the
shared_data
andshared_nothing
modes simultaneously for a StarRocks cluster. Mixed deployment is not supported.- Do not change
run_mode
after the cluster is deployed. Otherwise, the cluster fails to restart. The transformation from a shared-nothing cluster to a shared-data cluster or vice versa is not supported.
cloud_native_meta_port
The cloud-native meta service RPC port.
- Default:
6090
enable_load_volume_from_conf
Whether to allow StarRocks to create the default storage volume by using the object storage-related properties specified in the FE configuration file. Valid values:
true
(Default) If you specify this item astrue
when creating a new shared-data cluster, StarRocks creates the built-in storage volumebuiltin_storage_volume
using the object storage-related properties in the FE configuration file, and sets it as the default storage volume. However, if you have not specified the object storage-related properties, StarRocks fails to start.false
If you specify this item asfalse
when creating a new shared-data cluster, StarRocks starts directly without creating the built-in storage volume. You must manually create a storage volume and set it as the default storage volume before creating any object in StarRocks. For more information, see Create the default storage volume.
Supported from v3.1.0.
CAUTION
We strongly recommend you leave this item as
true
while you are upgrading an existing shared-data cluster from v3.0. If you specify this item asfalse
, the databases and tables you created before the upgrade become read-only, and you cannot load data into them.
cloud_native_storage_type
The type of object storage you use. In shared-data mode, StarRocks supports storing data in Azure Blob (supported from v3.1.1 onwards), and object storages that are compatible with the S3 protocol (such as AWS S3, Google GCP, and MinIO). Valid value:
S3
(Default)AZBLOB
HDFS
NOTE
- If you specify this parameter as
S3
, you must add the parameters prefixed byaws_s3
.- If you specify this parameter as
AZBLOB
, you must add the parameters prefixed byazure_blob
.- If you specify this parameter as
HDFS
, you must add the parametercloud_native_hdfs_url
.
aws_s3_path
The S3 path used to store data. It consists of the name of your S3 bucket and the sub-path (if any) under it, for example, testbucket/subpath
.
aws_s3_endpoint
The endpoint used to access your S3 bucket, for example, https://s3.us-west-2.amazonaws.com
.
aws_s3_region
The region in which your S3 bucket resides, for example, us-west-2
.
aws_s3_use_aws_sdk_default_behavior
Whether to use the AWS SDK default credentials provider chain. Valid values:
true
false
(Default)
aws_s3_use_instance_profile
Whether to use Instance Profile and Assumed Role as credential methods for accessing S3. Valid values:
true
false
(Default)
If you use IAM user-based credential (Access Key and Secret Key) to access S3, you must specify this item as false
, and specify aws_s3_access_key
and aws_s3_secret_key
.
If you use Instance Profile to access S3, you must specify this item as true
.
If you use Assumed Role to access S3, you must specify this item as true
, and specify aws_s3_iam_role_arn
.
And if you use an external AWS account, you must also specify aws_s3_external_id
.
aws_s3_access_key
The Access Key ID used to access your S3 bucket.
aws_s3_secret_key
The Secret Access Key used to access your S3 bucket.
aws_s3_iam_role_arn
The ARN of the IAM role that has privileges on your S3 bucket in which your data files are stored.
aws_s3_external_id
The external ID of the AWS account that is used for cross-account access to your S3 bucket.
NOTE
Only credential-related configuration items can be modified after your shared-data StarRocks cluster is created. If you changed the original storage path-related configuration items, the databases and tables you created before the change become read-only, and you cannot load data into them.
If you want to create the default storage volume manually after the cluster is created, you only need to add the following configuration items:
run_mode = shared_data
cloud_native_meta_port = <meta_port>
enable_load_volume_from_conf = false
Configure CN nodes for shared-data StarRocks
Before starting CNs, add the following configuration items in the CN configuration file cn.conf:
starlet_port = <starlet_port>
storage_root_path = <storage_root_path>
starlet_port
The CN heartbeat service port for the StarRocks shared-data cluster. Default value: 9070
.
storage_root_path
The storage volume directory that the local cached data depends on. Multiple volumes are separated by semicolon (;). Example: /data1;/data2
.
The default value for storage_root_path
is ${STARROCKS_HOME}/storage
.
Local cache is effective when queries are frequent and the data being queried is recent, but there are cases that you may wish to turn off the local cache completely.
- In a Kubernetes environment with CN pods that scale up and down in number on demand, the pods may not have storage volumes attached.
- When the data being queried is in a data lake in remote storage and most of it is archive (old) data. If the queries are infrequent the data cache will have a low hit ratio and the benefit may not be worth having the cache.
To turn off the data cache set:
storage_root_path =
NOTE
The data is cached under the directory
<storage_root_path>/starlet_cache
.
Use your shared-data StarRocks cluster
The usage of shared-data StarRocks clusters is also similar to that of a classic shared-nothing StarRocks cluster, except that the shared-data cluster uses storage volumes and cloud-native tables to store data in object storage.
Create default storage volume
You can use the built-in storage volumes that StarRocks automatically creates, or you can manually create and set the default storage volume. This section describes how to manually create and set the default storage volume.
NOTE
If your shared-data StarRocks cluster is upgraded from v3.0, you do not need to define a default storage volume because StarRocks created one with the object storage-related properties you specified in the FE configuration file fe.conf. You can still create new storage volumes with other object storage resources and set the default storage volume differently.
To give your shared-data StarRocks cluster permission to store data in your object storage, you must reference a storage volume when you create databases or cloud-native tables. A storage volume consists of the properties and credential information of the remote data storage. If you have deployed a new shared-data StarRocks cluster and disallow StarRocks to create a built-in storage volume (by specifying enable_load_volume_from_conf
as false
), you must define a default storage volume before you can create databases and tables in the cluster.
The following example creates a storage volume def_volume
for an AWS S3 bucket defaultbucket
with the IAM user-based credential (Access Key and Secret Key), enables the Partitioned Prefix feature, and sets it as the default storage volume:
CREATE STORAGE VOLUME def_volume
TYPE = S3
LOCATIONS = ("s3://defaultbucket")
PROPERTIES
(
"enabled" = "true",
"aws.s3.region" = "us-west-2",
"aws.s3.endpoint" = "https://s3.us-west-2.amazonaws.com",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy",
"aws.s3.enable_partitioned_prefix" = "true"
);
SET def_volume AS DEFAULT STORAGE VOLUME;
For more information on how to create a storage volume for other object storages and set the default storage volume, see CREATE STORAGE VOLUME and SET DEFAULT STORAGE VOLUME.
Create a database and a cloud-native table
After you create a default storage volume, you can then create a database and a cloud-native table using this storage volume.
Shared-data StarRocks clusters support all StarRocks table types.
The following example creates a database cloud_db
and a table detail_demo
based on Duplicate Key table type, enables the local disk cache, sets the hot data validity duration to one month, and disables asynchronous data ingestion into object storage:
CREATE DATABASE cloud_db;
USE cloud_db;
CREATE TABLE IF NOT EXISTS detail_demo (
recruit_date DATE NOT NULL COMMENT "YYYY-MM-DD",
region_num TINYINT COMMENT "range [-128, 127]",
num_plate SMALLINT COMMENT "range [-32768, 32767] ",
tel INT COMMENT "range [-2147483648, 2147483647]",
id BIGINT COMMENT "range [-2^63 + 1 ~ 2^63 - 1]",
password LARGEINT COMMENT "range [-2^127 + 1 ~ 2^127 - 1]",
name CHAR(20) NOT NULL COMMENT "range char(m),m in (1-255) ",
profile VARCHAR(500) NOT NULL COMMENT "upper limit value 65533 bytes",
ispass BOOLEAN COMMENT "true/false")
DUPLICATE KEY(recruit_date, region_num)
DISTRIBUTED BY HASH(recruit_date, region_num)
PROPERTIES (
"storage_volume" = "def_volume",
"datacache.enable" = "true",
"datacache.partition_duration" = "1 MONTH"
);
NOTE
The default storage volume is used when you create a database or a cloud-native table in a shared-data StarRocks cluster if no storage volume is specified.
In addition to the regular table PROPERTIES
, you need to specify the following PROPERTIES
when creating a table for shared-data StarRocks cluster:
datacache.enable
Whether to enable the local disk cache.
true
(Default) When this property is set totrue
, the data to be loaded is simultaneously written into the object storage and the local disk (as the cache for query acceleration).false
When this property is set tofalse
, the data is loaded only into the object storage.
NOTE
In version 3.0 this property was named
enable_storage_cache
.To enable the local disk cache, you must specify the directory of the disk in the CN configuration item
storage_root_path
.
datacache.partition_duration
The validity duration of the hot data. When the local disk cache is enabled, all data is loaded into the cache. When the cache is full, StarRocks deletes the less recently used data from the cache. When a query needs to scan the deleted data, StarRocks checks if the data is within the duration of validity starting from the current time. If the data is within the duration, StarRocks loads the data into the cache again. If the data is not within the duration, StarRocks does not load it into the cache. This property is a string value that can be specified with the following units: YEAR
, MONTH
, DAY
, and HOUR
, for example, 7 DAY
and 12 HOUR
. If it is not specified, all data is cached as the hot data.
NOTE
In version 3.0 this property was named
storage_cache_ttl
.This property is available only when
datacache.enable
is set totrue
.
View table information
You can view the information of tables in a specific database using SHOW PROC "/dbs/<db_id>"
. See SHOW PROC for more information.
Example:
mysql> SHOW PROC "/dbs/xxxxx";
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| TableId | TableName | IndexNum | PartitionColumnName | PartitionNum | State | Type | LastConsistencyCheckTime | ReplicaCount | PartitionType | StoragePath |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| 12003 | detail_demo | 1 | NULL | 1 | NORMAL | CLOUD_NATIVE | NULL | 8 | UNPARTITIONED | s3://xxxxxxxxxxxxxx/1/12003/ |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
The Type
of a table in shared-data StarRocks cluster is CLOUD_NATIVE
. In the field StoragePath
, StarRocks returns the object storage directory where the table is stored.
Load data into a shared-data StarRocks cluster
Shared-data StarRocks clusters support all loading methods provided by StarRocks. See Loading options for more information.
Query in a shared-data StarRocks cluster
Tables in a shared-data StarRocks cluster support all types of queries provided by StarRocks. See StarRocks SELECT for more information.
NOTE
Shared-data StarRocks clusters support synchronous materialized views from v3.4.0.