File external table
File external table is a special type of external table. It allows you to directly query Parquet and ORC data files in external storage systems without loading data into StarRocks. In addition, file external tables do not rely on a metastore. In the current version, StarRocks supports the following external storage systems: HDFS, Amazon S3, and other S3-compatible storage systems.
This feature is supported from StarRocks v2.5.
The File external feature was designed to help with importing data into StarRocks NOT to perform queries against external systems as a normal operation. Although you can query and use functions like JOIN, it is not performant. A more performant solution would be to import the data into StarRocks.
Limits
- File external tables must be created in databases within the default_catalog. You can run SHOW CATALOGS to query catalogs created in the cluster.
- Only Parquet and ORC data files are supported.
- You can only use file external tables to query data in the target data file. Data write operations such as INSERT, DELETE, and DROP are not supported.
Prerequisites
Before you create a file external table, you must configure your StarRocks cluster so that StarRocks can access the external storage system where the target data file is stored. The configurations required for a file external table are the same as those required for a Hive catalog, except that you do not need to configure a metastore. See Hive catalog - Integration preparations for more information about configurations.
Create a database (Optional)
After connecting to your StarRocks cluster, you can create a file external table in an existing database or create a new database to manage file external tables. To query existing databases in the cluster, run SHOW DATABASES. Then you can run USE <db_name>
to switch to the target database.
The syntax for creating a database is as follows.
CREATE DATABASE [IF NOT EXISTS] <db_name>
Create a file external table
After accessing the target database, you can create a file external table in this database.
Syntax
CREATE EXTERNAL TABLE <table_name>
(
<col_name> <col_type> [NULL | NOT NULL] [COMMENT "<comment>"]
)
ENGINE=file
COMMENT ["comment"]
PROPERTIES
(
FileLayoutParams,
StorageCredentialParams
)
Parameters
Parameter | Required | Description |
---|---|---|
table_name | Yes | The name of the file external table. The naming conventions are as follows:
|
col_name | Yes | The column name in the file external table. The column names in the file external table must be the same as those in the target data file but are not case-sensitive. The order of columns in the file external table can be different from that in the target data file. |
col_type | Yes | The column type in the file external table. You need to specify this parameter based on the column type in the target data file. For more information, see Mapping of column types. |
NULL | NOT NULL | No | Whether the column in the file external table is allowed to be NULL.
|
comment | No | The comment of column in the file external table. |
ENGINE | Yes | The type of engine. Set the value to file. |
comment | No | The description of the file external table. |
PROPERTIES | Yes |
|
FileLayoutParams
A set of parameters for accessing the target data file.
"path" = "<file_path>",
"format" = "<file_format>"
"enable_recursive_listing" = "{ true | false }"
"enable_wildcards" = "{ true | false }"
Parameter | Required | Description |
---|---|---|
path | Yes | The path of the data file.
|
format | Yes | The format of the data file. Valid values: parquet and orc . |
enable_recursive_listing | No | Specifies whether to recursively transverse all files under the current path. Default value: true . The value true specifies to recursively list subdirectories, and the value false specifies to ignore subdirectories. |
enable_wildcards | No | Whether to support using wildcards (* ) in path . Default value: false . For example, 2024-07-* is to match all files with the 2024-07- prefix. This parameter is supported from v3.1.9. |
StorageCredentialParams (Optional)
A set of parameters about how StarRocks integrates with the target storage system. This parameter set is optional.
You need to configure StorageCredentialParams
only when the target storage system is AWS S3 or other S3-compatible storage.
For other storage systems, you can ignore StorageCredentialParams
.
AWS S3
If you need to access a data file stored in AWS S3, configure the following authentication parameters in StorageCredentialParams
.
- If you choose the instance profile-based authentication method, configure
StorageCredentialParams
as follows:
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<aws_s3_region>"
- If you choose the assumed role-based authentication method, configure
StorageCredentialParams
as follows:
"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "<ARN of your assumed role>",
"aws.s3.region" = "<aws_s3_region>"
- If you choose the IAM user-based authentication method, configure
StorageCredentialParams
as follows:
"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "<iam_user_access_key>",
"aws.s3.secret_key" = "<iam_user_secret_key>",
"aws.s3.region" = "<aws_s3_region>"
Parameter name | Required | Description |
---|---|---|
aws.s3.use_instance_profile | Yes | Specifies whether to enable the instance profile-based authentication method and the assumed role-based authentication method when you access AWS S3. Valid values: true and false . Default value: false . |
aws.s3.iam_role_arn | Yes | The ARN of the IAM role that has privileges on your AWS S3 bucket. If you use the assumed role-based authentication method to access AWS S3, you must specify this parameter. Then, StarRocks will assume this role when it accesses the target data file. |
aws.s3.region | Yes | The region in which your AWS S3 bucket resides. Example: us-west-1. |
aws.s3.access_key | No | The access key of your IAM user. If you use the IAM user-based authentication method to access AWS S3, you must specify this parameter. |
aws.s3.secret_key | No | The secret key of your IAM user. If you use the IAM user-based authentication method to access AWS S3, you must specify this parameter. |
For information about how to choose an authentication method for accessing AWS S3 and how to configure an access control policy in the AWS IAM Console, see Authentication parameters for accessing AWS S3.
S3-compatible storage
If you need to access an S3-compatible storage system, such as MinIO, configure StorageCredentialParams
as follows to ensure a successful integration:
"aws.s3.enable_ssl" = "false",
"aws.s3.enable_path_style_access" = "true",
"aws.s3.endpoint" = "<s3_endpoint>",
"aws.s3.access_key" = "<iam_user_access_key>",
"aws.s3.secret_key" = "<iam_user_secret_key>"
The following table describes the parameters you need to configure in StorageCredentialParams
.
Parameter | Required | Description |
---|---|---|
aws.s3.enable_ssl | Yes | Specifies whether to enable SSL connection. Valid values: true and false . Default value: true . |
aws.s3.enable_path_style_access | Yes | Specifies whether to enable path-style access. Valid values: true and false . Default value: false . For MinIO, you must set the value to true .Path-style URLs use the following format: https://s3.<region_code>.amazonaws.com/<bucket_name>/<key_name> . For example, if you create a bucket named DOC-EXAMPLE-BUCKET1 in the US West (Oregon) Region, and you want to access the alice.jpg object in that bucket, you can use the following path-style URL: https://s3.us-west-2.amazonaws.com/DOC-EXAMPLE-BUCKET1/alice.jpg . |
aws.s3.endpoint | Yes | The endpoint used to connect to an S3-compatible storage system instead of AWS S3. |
aws.s3.access_key | Yes | The access key of your IAM user. |
aws.s3.secret_key | Yes | The secret key of your IAM user. |
Mapping of column types
The following table provides the mapping of column types between the target data file and the file external table.
Data file | File external table |
---|---|
INT/INTEGER | INT |
BIGINT | BIGINT |
TIMESTAMP | DATETIME. Note that TIMESTAMP is converted to DATETIME without a time zone based on the time zone setting of the current session and loses some of its precision. |
STRING | STRING |
VARCHAR | VARCHAR |
CHAR | CHAR |
DOUBLE | DOUBLE |
FLOAT | FLOAT |
DECIMAL | DECIMAL |
BOOLEAN | BOOLEAN |
ARRAY | ARRAY |
MAP | MAP |
STRUCT | STRUCT |
Examples
HDFS
Create a file external table named t0
to query Parquet data files stored in an HDFS path.
USE db_example;
CREATE EXTERNAL TABLE t0
(
name string,
id int
)
ENGINE=file
PROPERTIES
(
"path"="hdfs://x.x.x.x:8020/user/hive/warehouse/person_parq/",
"format"="parquet"
);
AWS S3
Example 1: Create a file external table and use instance profile to access a single Parquet file in AWS S3.
USE db_example;
CREATE EXTERNAL TABLE table_1
(
name string,
id int
)
ENGINE=file
PROPERTIES
(
"path" = "s3://bucket-test/folder1/raw_0.parquet",
"format" = "parquet",
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "us-west-2"
);
Example 2: Create a file external table and use assumed role to access all the ORC files under the target file path in AWS S3.
USE db_example;
CREATE EXTERNAL TABLE table_1
(
name string,
id int
)
ENGINE=file
PROPERTIES
(
"path" = "s3://bucket-test/folder1/",
"format" = "orc",
"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "arn:aws:iam::51234343412:role/role_name_in_aws_iam",
"aws.s3.region" = "us-west-2"
);
Example 3: Create a file external table and use IAM user to access all the ORC files under the file path in AWS S3.
USE db_example;
CREATE EXTERNAL TABLE table_1
(
name string,
id int
)
ENGINE=file
PROPERTIES
(
"path" = "s3://bucket-test/folder1/",
"format" = "orc",
"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "<iam_user_access_key>",
"aws.s3.secret_key" = "<iam_user_access_key>",
"aws.s3.region" = "us-west-2"
);
Query a file external table
Syntax:
SELECT <clause> FROM <file_external_table>
For example, to query data from the file external table t0
created in Examples - HDFS, run the following command:
SELECT * FROM t0;
+--------+------+
| name | id |
+--------+------+
| jack | 2 |
| lily | 1 |
+--------+------+
2 rows in set (0.08 sec)
Manage file external tables
You can view the schema of the table using DESC or drop the table using DROP TABLE.