Version: 3.1

Load data from Microsoft Azure Storage

StarRocks allows you to load data in bulk from Microsoft Azure Storage by using Broker Load.

Broker Load runs in asynchronous mode. An asynchronous Broker Load process handles making the connection to Azure, pulling the data, and storing the data in StarRocks.

Broker Load supports the Parquet, ORC, and CSV file formats.

Advantages of Broker Load

Broker Load runs in the background and clients do not need to stay connected for the job to continue.
Broker Load is preferred for long-running jobs, with the default timeout spanning 4 hours.
In addition to Parquet and ORC file formats, Broker Load supports CSV files.

Data flow

Workflow of Broker Load

The user creates a load job.
The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BEs) or compute nodes (CNs).
The BEs or CNs pull the data from the source and load the data into StarRocks.

Before you begin

Make source data ready

Make sure that the source data you want to load into StarRocks is properly stored in a container within your Azure storage account.

In this topic, suppose you want to load the data of a Parquet-formatted sample dataset (user_behavior_ten_million_rows.parquet) stored in the root directory of a container (starrocks-container) within an Azure Data Lake Storage Gen2 (ADLS Gen2) storage account (starrocks).

Check privileges

You can load data into StarRocks tables only as a user who has the INSERT privilege on those StarRocks tables. If you do not have the INSERT privilege, follow the instructions provided in GRANT to grant the INSERT privilege to the user that you use to connect to your StarRocks cluster.

Gather authentication details

The examples in this topic use the Shared Key authentication method. To ensure that you have permission to read data from ADLS Gen2, we recommend that you read Azure Data Lake Storage Gen2 > Shared Key (access key of storage account) to understand the authentication parameters that you need to configure.

In a nutshell, if you practice Shared Key authentication, you need to gather the following information:

The username of your ADLS Gen2 storage account
The shared key of your ADLS Gen2 storage account

For information about all the authentication methods available, see Authenticate to Azure cloud storage.

Typical example

Create a table, start a load process that pulls the sample dataset user_behavior_ten_million_rows.parquet from Azure, and verify the progress and success of the data loading.

Create a database and a table

Connect to your StarRocks cluster. Then, create a database and switch to it:

CREATE DATABASE IF NOT EXISTS mydatabase;
USE mydatabase;

Create a table by hand (we recommend that the table have the same schema as the Parquet file you want to load from Azure):

CREATE TABLE user_behavior
(
    UserID int(11),
    ItemID int(11),
    CategoryID int(11),
    BehaviorType varchar(65533),
    Timestamp varbinary
)
ENGINE = OLAP 
DUPLICATE KEY(UserID)
DISTRIBUTED BY HASH(UserID);

Start a Broker Load

Run the following command to start a Broker Load job that loads data from the sample dataset user_behavior_ten_million_rows.parquet to the user_behavior table:

LOAD LABEL user_behavior
(
    DATA INFILE("abfss://starrocks-container@starrocks.dfs.core.windows.net/user_behavior_ten_million_rows.parquet")
    INTO TABLE user_behavior
    FORMAT AS "parquet"
)
WITH BROKER
(
    "azure.adls2.storage_account" = "starrocks",
    "azure.adls2.shared_key" = "xxxxxxxxxxxxxxxxxx"
)
PROPERTIES
(
    "timeout" = "3600"
);

This job has four main sections:

LABEL: A string used when querying the state of the load job.
LOAD declaration: The source URI, source data format, and destination table name.
BROKER: The connection details for the source.
PROPERTIES: The timeout value and any other properties to apply to the load job.

For detailed syntax and parameter descriptions, see BROKER LOAD.

Check load progress

You can query the progress of Broker Load jobs from the loads view in the StarRocks Information Schema. This feature is supported from v3.1 onwards.

SELECT * FROM information_schema.loads \G

If you have submitted multiple load jobs, you can filter on the LABEL associated with the job:

SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior' \G
*************************** 1. row ***************************
              JOB_ID: 10250
               LABEL: user_behavior
       DATABASE_NAME: mydatabase
               STATE: FINISHED
            PROGRESS: ETL:100%; LOAD:100%
                TYPE: BROKER
            PRIORITY: NORMAL
           SCAN_ROWS: 10000000
       FILTERED_ROWS: 0
     UNSELECTED_ROWS: 0
           SINK_ROWS: 10000000
            ETL_INFO:
           TASK_INFO: resource:N/A; timeout(s):3600; max_filter_ratio:0.0
         CREATE_TIME: 2023-12-28 16:15:19
      ETL_START_TIME: 2023-12-28 16:15:25
     ETL_FINISH_TIME: 2023-12-28 16:15:25
     LOAD_START_TIME: 2023-12-28 16:15:25
    LOAD_FINISH_TIME: 2023-12-28 16:16:31
         JOB_DETAILS: {"All backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[10121]},"FileNumber":1,"FileSize":132251298,"InternalTableLoadBytes":311710786,"InternalTableLoadRows":10000000,"ScanBytes":132251298,"ScanRows":10000000,"TaskNumber":1,"Unfinished backends":{"6a8ef4c0-1009-48c9-8d18-c4061d2255bf":[]}}
           ERROR_MSG: NULL
        TRACKING_URL: NULL
        TRACKING_SQL: NULL
REJECTED_RECORD_PATH: NULL

For information about the fields provided in the loads view, see see Information Schema.

After you confirm that the load job has finished, you can check a subset of the destination table to see if the data has been successfully loaded. Example:

SELECT * from user_behavior LIMIT 3;

If data can be successfully returned, the data loading succeeds.

The system returns a query result similar to the following, indicating that the data has been successfully loaded:

+--------+---------+------------+--------------+---------------------+
| UserID | ItemID  | CategoryID | BehaviorType | Timestamp           |
+--------+---------+------------+--------------+---------------------+
|    142 | 2869980 |    2939262 | pv           | 2017-11-25 03:43:22 |
|    142 | 2522236 |    1669167 | pv           | 2017-11-25 15:14:12 |
|    142 | 3031639 |    3607361 | pv           | 2017-11-25 15:19:25 |
+--------+---------+------------+--------------+---------------------+

Advantages of Broker Load​

Data flow​

Before you begin​

Make source data ready​

Check privileges​

Gather authentication details​

Typical example​

Create a database and a table​

Start a Broker Load​

Check load progress​

What did you think of this doc?