HDFS Load
Load data from HDFS
StarRocks provides the following options for loading data from HDFS:
- Synchronous loading using INSERT+
FILES()
- Asynchronous loading using Broker Load
- Continuous asynchronous loading using Pipe
Each of these options has its own advantages, which are detailed in the following sections.
In most cases, we recommend that you use the INSERT+FILES()
method, which is much easier to use.
However, the INSERT+FILES()
method currently supports only the Parquet, ORC, and CSV file formats. Therefore, if you need to load data of other file formats such as JSON, or perform data changes such as DELETE during data loading, you can resort to Broker Load.
If you need to load a large number of data files with a significant data volume in total (for example, more than 100 GB or even 1 TB), we recommend that you use the Pipe method. Pipe can split the files based on their number or size, breaking down the load job into smaller, sequential tasks. This approach ensures that errors in one file do not impact the entire load job and minimizes the need for retries due to data errors.
Before you begin
Make source data ready
Make sure the source data you want to load into StarRocks is properly stored in your HDFS cluster. This topic assumes that you want to load /user/amber/user_behavior_ten_million_rows.parquet
from HDFS into StarRocks.
Check privileges
You can load data into StarRocks tables only as a user who has the INSERT privilege on those StarRocks tables. If you do not have the INSERT privilege, follow the instructions provided in GRANT to grant the INSERT privilege to the user that you use to connect to your StarRocks cluster. The syntax is GRANT INSERT ON TABLE <table_name> IN DATABASE <database_name> TO { ROLE <role_name> | USER <user_identity>}
.
Gather authentication details
You can use the simple authentication method to establish connections with your HDFS cluster. To use simple authentication, you need to gather the username and password of the account that you can use to access the NameNode of the HDFS cluster.
Use INSERT+FILES()
This method is available from v3.1 onwards and currently supports only the Parquet, ORC, and CSV (from v3.3.0 onwards) file formats.