Iceberg catalog
This example uses the Local Climatological Data(LCD) dataset featured in the StarRocks Basics Quick Start. You can load the data and try the example yourself.
An Iceberg catalog is a type of external catalog that is supported by StarRocks from v2.4 onwards. With Iceberg catalogs, you can:
- Directly query data stored in Iceberg without the need to manually create tables.
- Use INSERT INTO or asynchronous materialized views (which are supported from v2.5 onwards) to process data stored in Iceberg and load the data into StarRocks.
- Perform operations on StarRocks to create or drop Iceberg databases and tables, or sink data from StarRocks tables to Parquet-formatted Iceberg tables by using INSERT INTO (this feature is supported from v3.1 onwards).
To ensure successful SQL workloads on your Iceberg cluster, your StarRocks cluster must be able to access the storage system and metastore of your Iceberg cluster. StarRocks supports the following storage systems and metastores:
-
Distributed file system (HDFS) or object storage like AWS S3, Microsoft Azure Storage, Google GCS, or other S3-compatible storage system (for example, MinIO)
-
Metastore like Hive metastore, AWS Glue, or Tabular
- If you choose AWS S3 as storage, you can use HMS or AWS Glue as metastore. If you choose any other storage system, you can only use HMS as metastore.
- If you choose Tabular as metastore, you need to use the Iceberg REST catalog.
Usage notesβ
Take note of the following points when you use StarRocks to query data from Iceberg:
| File format | Compression format | Iceberg table version |
|---|---|---|
| Parquet | SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION |
|
| ORC | ZLIB, SNAPPY, LZO, LZ4, ZSTD, and NO_COMPRESSION |
|
Integration preparationβ
Before you create an Iceberg catalog, make sure your StarRocks cluster can integrate with the storage system and metastore of your Iceberg cluster.
Storageβ
Select the tab that matches your storage type:
- AWS S3
- HDFS
If your Iceberg cluster uses AWS S3 as storage or AWS Glue as metastore, choose your suitable authentication method and make the required preparations to ensure that your StarRocks cluster can access the related AWS cloud resources.
The following authentication methods are recommended:
- Instance profile
- Assumed role
- IAM user
Of the above-mentioned three authentication methods, instance profile is the most widely used.
For more information, see Preparation for authentication in AWS IAM.
If you choose HDFS as storage, configure your StarRocks cluster as follows:
-
(Optional) Set the username that is used to access your HDFS cluster and Hive metastore. By default, StarRocks uses the username of the FE and BE or CN processes to access your HDFS cluster and Hive metastore. You can also set the username by adding
export HADOOP_USER_NAME="<user_name>"at the beginning of the fe/conf/hadoop_env.sh file of each FE and at the beginning of the be/conf/hadoop_env.sh file of each BE or the cn/conf/hadoop_env.sh file of each CN. After you set the username in these files, restart each FE and each BE or CN to make the parameter settings take effect. You can set only one username for each StarRocks cluster. -
When you query Iceberg data, the FEs and BEs or CNs of your StarRocks cluster use the HDFS client to access your HDFS cluster. In most cases, you do not need to configure your StarRocks cluster to achieve that purpose, and StarRocks starts the HDFS client using the default configurations. You need to configure your StarRocks cluster only in the following situations:
- High availability (HA) is enabled for your HDFS cluster: Add the hdfs-site.xml file of your HDFS cluster to the $FE_HOME/conf path of each FE and to the $BE_HOME/conf path of each BE or the $CN_HOME/conf path of each CN.
- View File System (ViewFs) is enabled for your HDFS cluster: Add the core-site.xml file of your HDFS cluster to the $FE_HOME/conf path of each FE and to the $BE_HOME/conf path of each BE or the $CN_HOME/conf path of each CN.
If an error indicating an unknown host is returned when you send a query, you must add the mapping between the host names and IP addresses of your HDFS cluster nodes to the /etc/hosts path.
Kerberos authenticationβ
If Kerberos authentication is enabled for your HDFS cluster or Hive metastore, configure your StarRocks cluster as follows:
- Run the
kinit -kt keytab_path principalcommand on each FE and each BE or CN to obtain Ticket Granting Ticket (TGT) from Key Distribution Center (KDC). To run this command, you must have the permissions to access your HDFS cluster and Hive metastore. Note that accessing KDC with this command is time-sensitive. Therefore, you need to use cron to run this command periodically. - Add
JAVA_OPTS="-Djava.security.krb5.conf=/etc/krb5.conf"to the $FE_HOME/conf/fe.conf file of each FE and to the $BE_HOME/conf/be.conf file of each BE or the $CN_HOME/conf/cn.conf file of each CN. In this example,/etc/krb5.confis the save path of the krb5.conf file. You can modify the path based on your needs.
Create an Iceberg catalogβ
Syntaxβ
CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES
(
"type" = "iceberg",
[SecurityParams],
MetastoreParams,
StorageCredentialParams,
MetadataRelatedParams
)
Parametersβ
catalog_nameβ
The name of the Iceberg catalog. The naming conventions are as follows:
- The name can contain letters, digits (0-9), and underscores (_). It must start with a letter.
- The name is case-sensitive and cannot exceed 1023 characters in length.
commentβ
The description of the Iceberg catalog. This parameter is optional.