Version: 3.1

FILES

Description

Reads a data file from cloud storage. FILES() accesses cloud storage with the path-related properties of the file, infers the table schema of the data in the file, and returns the data rows. You can directly query the data rows using SELECT, load the data rows into an existing table using INSERT, or create a new table and load the data rows into it using CREATE TABLE AS SELECT. This feature is supported from v3.1.0 onwards.

Currently, the FILES() function supports the following data sources and file formats:

Data sources:
- AWS S3
File formats:
- Parquet
- ORC

Syntax

FILES( data_location , data_format [, StorageCredentialParams ] )

data_location ::=
    "path" = "s3://<s3_path>"

data_format ::=
    "format" = "{parquet | orc}"

Parameters

All parameters are in the "key" = "value" pairs.

Key	Required	Description
path	Yes	The URI used to access the file. Example: `s3://testbucket/parquet/test.parquet`.
format	Yes	The format of the data file. Valid values: `parquet` and `orc`.

StorageCredentialParams

The authentication information used by StarRocks to access your storage system.

Use the IAM user-based authentication to access AWS S3:

"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
"aws.s3.region" = "<s3_region>"

Key	Required	Description
aws.s3.access_key	Yes	The Access Key ID that you can use to access the Amazon S3 bucket.
aws.s3.secret_key	Yes	The Secret Access Key that you can use to access the Amazon S3 bucket.
aws.s3.region	Yes	The region in which your AWS S3 bucket resides. Example: `us-west-2`.

Return

When used with SELECT, FILES() returns the data in the file as a table.

When querying Parquet or ORC files, you can directly specify the name of the desired columns in the SELECT statement, or specify * to obtain data from all columns.

SELECT * FROM FILES(
    "path" = "s3://inserttest/parquet/file2.parquet",
    "format" = "parquet",
    "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
    "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
    "aws.s3.region" = "us-west-2"
)
WHERE c1 IN (101,105);
+------+------+---------------------+
| c1   | c2   | c3                  |
+------+------+---------------------+
|  101 |    9 | 2018-05-15T18:30:00 |
|  105 |    6 | 2018-05-15T18:30:00 |
+------+------+---------------------+
2 rows in set (0.29 sec)

SELECT c1, c3 FROM FILES(
    "path" = "s3://inserttest/parquet/file2.parquet",
    "format" = "parquet",
    "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
    "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
    "aws.s3.region" = "us-west-2"
);
+------+---------------------+
| c1   | c3                  |
+------+---------------------+
|  101 | 2018-05-15T18:30:00 |
|  102 | 2018-05-15T18:30:00 |
|  103 | 2018-05-15T18:30:00 |
|  104 | 2018-05-15T18:30:00 |
|  105 | 2018-05-15T18:30:00 |
|  106 | 2018-05-15T18:30:00 |
|  107 | 2018-05-15T18:30:00 |
|  108 | 2018-05-15T18:30:00 |
|  109 | 2018-05-15T18:30:00 |
|  110 | 2018-05-15T18:30:00 |
+------+---------------------+
10 rows in set (0.55 sec)

Examples

Example 1: Query the data from the Parquet file parquet/par-dup.parquet within the AWS S3 bucket inserttest:

MySQL > SELECT * FROM FILES(
     "path" = "s3://inserttest/parquet/par-dup.parquet",
     "format" = "parquet",
     "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
     "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
     "aws.s3.region" = "us-west-2"
);
+------+---------------------------------------------------------+
| c1   | c2                                                      |
+------+---------------------------------------------------------+
|    1 | {"1": "key", "1": "1", "111": "1111", "111": "aaaa"}    |
|    2 | {"2": "key", "2": "NULL", "222": "2222", "222": "bbbb"} |
+------+---------------------------------------------------------+
2 rows in set (22.335 sec)

Example 2: Insert the data rows from the Parquet file parquet/insert_wiki_edit_append.parquet within the AWS S3 bucket inserttest into the table insert_wiki_edit:

MySQL > INSERT INTO insert_wiki_edit
    SELECT * FROM FILES(
        "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet",
        "format" = "parquet",
        "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
        "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
        "aws.s3.region" = "us-west-2"
);
Query OK, 2 rows affected (23.03 sec)
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}

Example 3: Create a table named ctas_wiki_edit and insert the data rows from the Parquet file parquet/insert_wiki_edit_append.parquet within the AWS S3 bucket inserttest into the table:

MySQL > CREATE TABLE ctas_wiki_edit AS
    SELECT * FROM FILES(
        "path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet",
        "format" = "parquet",
        "aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
        "aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB",
        "aws.s3.region" = "us-west-2"
);
Query OK, 2 rows affected (22.09 sec)
{'label':'insert_1a217d70-2f52-11ee-9e4a-7a563fb695da', 'status':'VISIBLE', 'txnId':'3248'}

Description​

Syntax​

Parameters​

StorageCredentialParams​

Return​

Examples​

What did you think of this doc?