Back up and restore data
This topic describes how to back up and restore data in StarRocks, or migrate data to a new StarRocks cluster.
StarRocks supports backing up data as snapshots into a remote storage system, and restoring the data to any StarRocks clusters.
StarRocks supports the following remote storage systems:
- Apache™ Hadoop® (HDFS) cluster
- AWS S3
- Google GCS
NOTE
Shared-data StarRocks clusters do not support data BACKUP and RESTORE.
Back up data
StarRocks supports FULL backup on the granularity level of database, table, or partition.
If you have stored a large amount of data in a table, we recommend that you back up and restore data by partition. This way, you can reduce the cost of retries in case of job failures. If you need to back up incremental data on a regular basis, you can strategize a dynamic partitioning plan (by a certain time interval, for example) for your table, and back up only new partitions each time.
Create a repository
Before backing up data, you need to create a repository, which is used to store data snapshots in a remote storage system. You can create multiple repositories in a StarRocks cluster. For detailed instructions, see CREATE REPOSITORY.
- Create a repository in HDFS
The following example creates a repository named test_repo
in an HDFS cluster.
CREATE REPOSITORY test_repo
WITH BROKER
ON LOCATION "hdfs://<hdfs_host>:<hdfs_port>/repo_dir/backup"
PROPERTIES(
"username" = "<hdfs_username>",
"password" = "<hdfs_password>"
);
-
Create a repository in AWS S3
You can choose IAM user-based credential (Access Key and Secret Key), Instance Profile, or Assumed Role as the credential method for accessing AWS S3.
- The following example creates a repository named
test_repo
in the AWS S3 bucketbucket_s3
using IAM user-based credential as the credential method.
CREATE REPOSITORY test_repo
WITH BROKER
ON LOCATION "s3a://bucket_s3/backup"
PROPERTIES(
"aws.s3.access_key" = "XXXXXXXXXXXXXXXXX",
"aws.s3.secret_key" = "yyyyyyyyyyyyyyyyyyyyyyyy",
"aws.s3.endpoint" = "s3.us-east-1.amazonaws.com"
);- The following example creates a repository named
test_repo
in the AWS S3 bucketbucket_s3
using Instance Profile as the credential method.
CREATE REPOSITORY test_repo
WITH BROKER
ON LOCATION "s3a://bucket_s3/backup"
PROPERTIES(
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "us-east-1"
);- The following example creates a repository named
test_repo
in the AWS S3 bucketbucket_s3
using Assumed Role as the credential method.
CREATE REPOSITORY test_repo
WITH BROKER
ON LOCATION "s3a://bucket_s3/backup"
PROPERTIES(
"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "arn:aws:iam::xxxxxxxxxx:role/yyyyyyyy",
"aws.s3.region" = "us-east-1"
); - The following example creates a repository named
NOTE
StarRocks supports creating repositories in AWS S3 only according to the S3A protocol. Therefore, when you create repositories in AWS S3, you must replace
s3://
in the S3 URI you pass as a repository location inON LOCATION
withs3a://
.
- Create a repository in Google GCS
The following example creates a repository named test_repo
in the Google GCS bucket bucket_gcs
.
CREATE REPOSITORY test_repo
WITH BROKER
ON LOCATION "s3a://bucket_gcs/backup"
PROPERTIES(
"fs.s3a.access.key" = "xxxxxxxxxxxxxxxxxxxx",
"fs.s3a.secret.key" = "yyyyyyyyyyyyyyyyyyyy",
"fs.s3a.endpoint" = "storage.googleapis.com"
);
NOTE
- StarRocks supports creating repositories in Google GCS only according to the S3A protocol. Therefore, when you create repositories in Google GCS, you must replace the prefix in the GCS URI you pass as a repository location in
ON LOCATION
withs3a://
.- Do not specify
https
in the endpoint address.
After the repository is created, you can check the repository via SHOW REPOSITORIES. After restoring data, you can delete the repository in StarRocks using DROP REPOSITORY. However, data snapshots backed up in the remote storage system cannot be deleted through StarRocks. You need to delete them manually in the remote storage system.
Back up a data snapshot
After the repository is created, you need to create a data snapshot and back up it in the remote repository. For detailed instructions, see BACKUP.
The following example creates a data snapshot sr_member_backup
for the table sr_member
in the database sr_hub
and backs up it in the repository test_repo
.
BACKUP SNAPSHOT sr_hub.sr_member_backup
TO test_repo
ON (sr_member);
StarRocks supports BACKUP and RESTORE operations on the following levels of granularity:
- Partition level: You need to specify the ON clause in the format
ON (<table_name> PARTITION (<partition_name>, ...))
. - Table level: You need to specify the ON clause in the format
ON (<table_name>)
. - Database level: You do not need to specify the ON clause. This will back up or restore the entire database.
BACKUP is an asynchronous operation. You can check the status of a BACKUP job using SHOW BACKUP, or cancel a BACKUP job using CANCEL BACKUP.
Restore or migrate data
You can restore the data snapshot backed up in the remote storage system to the current or other StarRocks clusters to restore or migrate data.
(Optional) Create a repository in the new cluster
To migrate data to another StarRocks cluster, you need to create a repository with the same repository name and location in the new cluster, otherwise you will not be able to view the previously backed up data snapshots. See Create a repository for details.
Check the snapshot
Before restoring data, you can check the snapshots in a specified repository using SHOW SNAPSHOT.
The following example checks the snapshot information in test_repo
.
mysql> SHOW SNAPSHOT ON test_repo;
+------------------+-------------------------+--------+
| Snapshot | Timestamp | Status |
+------------------+-------------------------+--------+
| sr_member_backup | 2023-02-07-14-45-53-143 | OK |
+------------------+-------------------------+--------+
1 row in set (1.16 sec)
Restore data via the snapshot
You can use the RESTORE statement to restore data snapshots in the remote storage system to the current or other StarRocks clusters.
The following example restores the data snapshot sr_member_backup
in test_repo
on the table sr_member
. It only restores ONE data replica.
RESTORE SNAPSHOT sr_hub.sr_member_backup
FROM test_repo
ON (sr_member)
PROPERTIES (
"backup_timestamp"="2023-02-07-14-45-53-143",
"replication_num" = "1"
);
StarRocks supports BACKUP and RESTORE operations on the following levels of granularity:
- Partition level: You need to specify the ON clause in the format
ON (<table_name> PARTITION (<partition_name>, ...))
. - Table level: You need to specify the ON clause in the format
ON (<table_name>)
. - Database level: You do not need to specify the ON clause. This will back up or restore the entire database.
RESTORE is an asynchronous operation. You can check the status of a RESTORE job using SHOW RESTORE, or cancel a RESTORE job using CANCEL RESTORE.
Configure BACKUP or RESTORE jobs
You can optimize the performance of BACKUP or RESTORE jobs by modifying the following configuration items in the BE configuration file be.conf:
Configuration item | Description |
---|---|
make_snapshot_worker_count | The maximum number of threads for the make snapshot tasks of BACKUP jobs on a BE node. Default: 5 . Increase the value of this configuration item to increase the concurrency of the make snapshot task. |
release_snapshot_worker_count | The maximum number of threads for the release snapshot tasks of failed BACKUP jobs on a BE node. Default: 5 . Increase the value of this configuration item to increase the concurrency of the release snapshot task. |
upload_worker_count | The maximum number of threads for the upload tasks of BACKUP jobs on a BE node. Default: 0 . 0 indicates setting the value to the number of CPU cores on the machine where the BE resides. Increase the value of this configuration item to increase the concurrency of the upload task. |
download_worker_count | The maximum number of threads for the download tasks of RESTORE jobs on a BE node. Default: 0 . 0 indicates setting the value to the number of CPU cores on the machine where the BE resides. Increase the value of this configuration item to increase the concurrency of the download task. |
Usage notes
- Performing backup and restore operations on global, database, table, and partition levels requires different privileges. For detailed information, see Customize roles based on scenarios.
- In each database, only one running BACKUP or RESTORE job is allowed each time. Otherwise, StarRocks returns an error.
- Because BACKUP and RESTORE jobs occupy many resources of your StarRocks cluster, you can back up and restore your data while your StarRocks cluster is not heavily loaded.
- StarRocks does not support specifying data compression algorithm for data backup.
- Because data is backed up as snapshots, the data loaded upon snapshot generation is not included in the snapshot. Therefore, if you load data into the old cluster after the snapshot is generated and before the RESTORE job is completed, you also need to load the data into the cluster that data is restored into. It is recommended that you load data into both clusters in parallel for a period of time after the data migration is complete, and then migrate your application to the new cluster after verifying the correctness of the data and services.
- Before the RESTORE job is completed, you cannot operate the table to be restored.
- Primary Key tables cannot be restored to a StarRocks cluster earlier than v2.5.
- You do not need to create the table to be restored in the new cluster before restoring it. The RESTORE job automatically creates it.
- If there is an existing table that has a duplicated name with the table to be restored, StarRocks first checks whether or not the schema of the existing table matches that of the table to be restored. If the schemas match, StarRocks overwrites the existing table with the data in the snapshot. If the schema does not match, the RESTORE job fails. You can either rename the table to be restored using the keyword
AS
, or delete the existing table before restoring data. - If the RESTORE job overwrites an existing database, table, or partition, the overwritten data cannot be restored after the job enters the COMMIT phase. If the RESTORE job fails or is canceled at this point, the data may be corrupted and inaccessible. In this case, you can only perform the RESTORE operation again and wait for the job to complete. Therefore, we recommend that you do not restore data by overwriting unless you are sure that the current data is no longer used. The overwrite operation first checks metadata consistency between the snapshot and the existing database, table, or partition. If an inconsistency is detected, the RESTORE operation cannot be performed.
- During a BACKUP or a RESTORE job, StarRocks automatically backs up or restores the Synchronous materialized view, which can still accelerate or rewrite your queries after data restoration. Currently, StarRocks does not support backing up views and Asynchronous materialized views. You can only back up the physical table of the materialized view, which cannot be used for query acceleration or query rewriting.
- Currently, StarRocks does not support backing up the configuration data related to user accounts, privileges, and resource groups.
- Currently, StarRocks does not support backing up and restoring the Colocate Join relationship among tables.