Feature Support: Data Lake Analytics
From v2.3 onwards, StarRocks supports managing external data sources and analyzing data in data lakes via external catalogs.
This document outlines the feature support for external catalogs and the supported version of the features involved.
Universal featuresβ
This section lists the universal features of the External Catalog feature, including storage systems, file readers, credentials, privileges, and Data Cache.
External storage systemsβ
| Storage System | Supported Version | 
|---|---|
| HDFS | v2.3+ | 
| AWS S3 | v2.3+ | 
| Microsoft Azure Storage | v3.0+ | 
| Google GCS | v3.0+ | 
| Alibaba Cloud OSS | v3.1+ | 
| Huawei Cloud OBS | v3.1+ | 
| Tencent Cloud COS | v3.1+ | 
| Volcengine TOS | v3.1+ | 
| Kingsoft Cloud KS3 | v3.1+ | 
| MinIO | v3.1+ | 
| Ceph S3 | v3.1+ | 
In addition to the native support for the storage systems listed above, StarRocks also supports the following types of object storage services:
- HDFS-compatible object storage services such as COS Cloud HDFS, OSS-HDFS, and OBS PFS
- Description: You need to specify the object storage URI prefix in the BE configuration item fallback_to_hadoop_fs_list, and upload the .jar package provided by the cloud vendor to the directory /lib/hadoop/hdfs/. Note that you must create the external catalog using the prefix you specified infallback_to_hadoop_fs_list.
- Supported Version(s): v3.1.9+, v3.2.4+
 
- Description: You need to specify the object storage URI prefix in the BE configuration item 
- S3-compatible object storage services other than those listed above
- Description: You need to specify the object storage URI prefix in the BE configuration item s3_compatible_fs_list. Note that you must create the external catalog using the prefix you specified ins3_compatible_fs_list.
- Supported Version(s): v3.1.9+, v3.2.4+
 
- Description: You need to specify the object storage URI prefix in the BE configuration item 
Compression formatsβ
This section only lists the compression formats supported by each file format. For the file formats supported by each external catalog, please refer to the section on the corresponding external catalog.
| File Format | Compression Formats | 
|---|---|
| Parquet | NO_COMPRESSION, SNAPPY, LZ4, ZSTD, GZIP, LZO (v3.1.5+) | 
| ORC | NO_COMPRESSION, ZLIB, SNAPPY, LZO, LZ4, ZSTD | 
| Text | NO_COMPRESSION, LZO (v3.1.5+) | 
| Avro | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+) | 
| RCFile | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), GZIP (v3.2.1+) | 
| SequenceFile | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+), GZIP (v3.2.1+) | 
The Avro, RCFile, and SequenceFile file formats are read by Java Native Interface (JNI) instead of the native readers within StarRocks. Therefore, the read performance for these file formats may not be as good as that of Parquet and ORC.
Management, credential, and access controlβ
| Feature | Description | Supported Version(s) | 
|---|---|---|
| Information Schema | Supports Information Schema for external catalogs. | v3.2+ | 
| Data lake access control | Supports StarRocks' native RBAC model for external catalogs. You can manage the privileges of databases, tables, and views (currently, Hive views and Iceberge views only) in external catalogs just like those in the default catalog of StarRocks. | v3.0+ | 
| Reuse external services on Apache Ranger | Supports reusing the external service (such as the Hive Service) on Apache Ranger for access control. | v3.1.9+ | 
| Kerberos authentication | Supports Kerberos authentication for HDFS or Hive Metastore. | v2.3+ | 
Data Cacheβ
| Feature | Description | Supported Version(s) | 
|---|---|---|
| Data Cache (Block Cache) | From v2.5 onwards, StarRocks supported the Data Cache feature (then called Block Cache) implemented using CacheLib, which led to limited optimization potential for its extensibility. Starting from v3.0, StarRocks refactored the cache implementation and added new features to Data Cache, resulting in better performance with each subsequent version. | v2.5+ | 
| Data rebalancing among local disks | Supports data rebalancing strategy to ensure that data skew is controlled under 10%. | v3.2+ | 
| Replace Block Cache with Data Cache | Parameter changes BE Configurations: 
 
 | v3.2+ | 
| New metrics for API that monitors Data Cache | Supports an individual API that monitors Data Cache including the cache capacity and hits. You can view Data Cache metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/api/datacache/stat. | v3.2.3+ | 
| Memory Tracker for Data Cache | Supports Memory Tracker for Data Cache. You can view the memory-related metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/mem_tracker. | v3.1.8+ | 
| Data Cache Warmup | By executing CACHE SELECT, you can proactively populate the cache with the desired data from remote storage in advance to prevent the first query from taking too much time fetching the data. CACHE SELECT will not print data or incur calculations. It only fetches data. | v3.3+ | 
Hive Catalogβ
Metadataβ
Hive Catalog's support for Hive Metastore (HMS) and AWS Glue mostly overlaps except that the automatic incremental update feature for HMS is not recommended. The default configuration is recommended in most cases.
The performance of metadata retrieval largely depends on the performance of the user's HMS or HDFS NameNode. Please consider all factors and base your judgment on test results.
- [Default and Recommended] Best performance with a tolerance of minute-level data inconsistency
- Configuration: You can use the default setting. Data updated within 10 minutes (by default) is not visible. Old data will be returned to queries within this duration.
- Advantage: Best query performance.
- Disadvantage: Data inconsistency caused by latency.
- Supported Version(s): v2.5.5+ (Disabled by default in v2.5 and enabled by default in v3.0+)
 
- Instant visibility of newly loaded data (files) without manual refresh
- Configuration: Disable the cache for the metadata of the underlying data files by setting the catalog property enable_remote_file_cachetofalse.
- Advantage: Visibility of file changes with no delay.
- Disadvantage: Lower performance when the file metadata cache is disabled. Each query must access the file list.
- Supported Version(s): v2.5.5+
 
- Configuration: Disable the cache for the metadata of the underlying data files by setting the catalog property 
- Instant visibility of partition changes without manual refresh
- Configuration: Disable the cache for the Hive partition names by setting the catalog property enable_cache_list_namestofalse.
- Advantage: Visibility of partition changes with no delay
- Disadvantage: Lower performance when the partition name cache is disabled. Each query must access the partition list.
- Supported Version(s): v2.5.5+
 
- Configuration: Disable the cache for the Hive partition names by setting the catalog property 
If you demand real-time updates on the data changes whilst the performance of your HMS is not optimized, you can enable the cache, disable the automatic incremental update, and manually refresh the metadata (using REFRESH EXTERNAL TABLE) via a scheduling system whenever there is a data change upstream.
Storage systemβ
| Feature | Description | Supported Version(s) | 
|---|---|---|
| Recursive sub-directory listing | Enable recursive sub-directory listing by setting the Catalog property enable_recursive_listingtotrue. When recursive listing is enabled, StarRocks will read data from a table and its partitions and from the subdirectories within the physical locations of the table and its partitions. This feature is designed to address the issue of multi-layer nested directories. | v2.5.9+ v3.0.4+ (Disabled by default in v2.5 and v3.0, and enabled by default in v3.1+) | 
File formats and data typesβ
File formatsβ
| Feature | Supported File Formats | 
|---|---|
| Read | Parquet, ORC, TEXT, Avro, RCFile, SequenceFile | 
| Sink | Parquet (v3.2+), ORC (v3.3+), TEXT (v3.3+) | 
Data typesβ
INTERVAL, BINARY, and UNION types are not supported.
TEXT-formatted Hive table does not support MAP and STRUCT types.
Hive viewsβ
StarRocks supports querying Hive views from v3.1.0 onwards.
Query statistics interfacesβ
| Feature | Supported Version(s) | 
|---|---|
| Supports SHOW CREATE TABLE to view Hive table schema | v3.0+ | 
| Supports ANALYZE to collect statistics | v3.2+ | 
| Supports collecting histograms and STRUCT subfield statistics | v3.3+ | 
Data sinkingβ
| Feature | Supported Version(s) | Note | 
|---|---|---|
| CREATE DATABASE | v3.2+ | You can choose to specify the location for a database created in Hive or not. If you do not specify the location for the database, you will need to specify the location for the tables created under the database. Otherwise, an error will be returned. If you have specified the location for the database, tables without the location specified will inherit the location of the database. And if you have specified locations for both the database and the table, the table's location will take effect eventually. | 
| CREATE TABLE | v3.2+ | For both partitioned and non-partitioned tables. | 
| CREATE TABLE AS SELECT | v3.2+ | |
| INSERT INTO/OVERWRITE | v3.2+ | For both partitioned and non-partitioned tables. | 
| CREATE TABLE LIKE | v3.2.4+ | |
| Sink file size | v3.3+ | You can define the maximum size of each data file to be sunk using the session variable connector_sink_target_max_file_size. | 
Iceberg Catalogβ
Metadataβ
Iceberg Catalog supports HMS, Glue, and Tabular as its metastore. The default configuration is recommended in most cases.
Please note that the default value of the session variable enable_iceberg_metadata_cache has been changed to accommodate different scenarios:
- From v3.2.1 to v3.2.3, this parameter is set to trueby default, regardless of what metastore service is used.
- In v3.2.4 and later, if the Iceberg cluster uses AWS Glue as metastore, this parameter still defaults to true. However, if the Iceberg cluster uses other metastore services such as Hive metastore, this parameter defaults tofalse.
- From v3.3.0 onwards, the default value of this parameter is set to trueagain because StarRocks supports the new Iceberg metadata framework. Iceberg Catalog and Hive Catalog now use the same metadata polling mechanism and FE configuration itembackground_refresh_metadata_interval_millis.
| Feature | Supported Version(s) | 
|---|---|
| Distributed metadata plan (Recommended for scenarios with a large volume of metadata) | v3.3+ | 
| Manifest Cache (Recommended for scenarios with a small volume of metadata but high demand on latency) | v3.3+ | 
From v3.3.0 onwards, StarRocks supports the metadata reading and caching policies described above. The system will automatically adjust the choice of policy according to the machines in your cluster. Usually, you do not need to change it. Since metadata caching is enabled, it is possible that metadata freshness may be compromised due to performance considerations. Therefore, you can adjust it according to your specific query requirements:
- [Default and recommended] Optimal performance with tolerance of minute-level data inconsistencies
- Setting: No additional setting is required. By default, data updated within 10 minutes is not visible. During this time, queries will return old data.
- Advantages: Best query performance.
- Disadvantage: data inconsistency caused by delays.
 
- New data files generated by import and partition additions or deletions are immediately visible, and no manual refresh is required
- Setting: Set the Catalog property iceberg_table_cache_ttl_secto0to allow StarRocks to fetch a new snapshot for each query.
- Advantages: File and partition changes are visible without delay.
- Disadvantage: Lower performance due to the snapshot fetching behavior for each query.
 
- Setting: Set the Catalog property 
File formatsβ
| Feature | Supported File Formats | 
|---|---|
| Read | Parquet, ORC | 
| Sink | Parquet | 
- Both Parquet-formatted and ORC-formatted Iceberg V1 tables support position deletes and equality deletes.
- ORC-formatted Iceberg V2 tables support position deletes from v3.0.0, and Parquet-formatted ones support position deletes from v3.1.0.
- ORC-formatted Iceberg V2 tables support equality deletes from v3.1.8 and v3.2.3, and Parquet-formatted ones support equality deletes from v3.2.5.
Iceberg viewsβ
StarRocks supports querying Iceberg views from v3.3.2 onwards. Currently, only Iceberg views created through StarRocks are supported.
Query statistics interfacesβ
| Feature | Supported Version(s) | 
|---|---|
| Supports SHOW CREATE TABLE to view Iceberg table schema | v3.0+ | 
| Supports ANALYZE to collect statistics | v3.2+ | 
| Supports collecting histograms and STRUCT subfield statistics | v3.3+ | 
Data sinkingβ
| Feature | Supported Version(s) | Note | 
|---|---|---|
| CREATE DATABASE | v3.1+ | You can choose to specify the location for a database created in Iceberg or not. If you do not specify the location for the database, you will need to specify the location for the tables created under the database. Otherwise, an error will be returned. If you have specified the location for the database, tables without the location specified will inherit the location of the database. And if you have specified locations for both the database and the table, the table's location will take effect eventually. | 
| CREATE TABLE | v3.1+ | For both partitioned and non-partitioned tables. | 
| CREATE TABLE AS SELECT | v3.1+ | |
| INSERT INTO/OVERWRITE | v3.1+ | For both partitioned and non-partitioned tables. | 
Miscellaneous supportsβ
| Feature | Supported Version(s) | 
|---|---|
| Supports reading TIMESTAMP-type partition formats yyyy-MM-ddTHH:mmandyyyy-MM-dd HH:mm. | v2.5.19+ v3.1.9+ v3.2.3+ | 
Hudi Catalogβ
- StarRocks supports querying the Parquet-formatted data in Hudi, and supports SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats for Parquet files.
- StarRocks fully supports Hudi's Copy On Write (COW) tables and Merge On Read (MOR) tables.
- StarRocks supports SHOW CREATE TABLE to view Hudi table schema from v3.0.0 onwards.
Delta Lake Catalogβ
- StarRocks supports querying the Parquet-formatted data in Delta Lake, and supports SNAPPY, LZ4, ZSTD, GZIP, and NO_COMPRESSION compression formats for Parquet files.
- StarRocks does not support querying the MAP-type and STRUCT-type data in Delta Lake.
- StarRocks supports SHOW CREATE TABLE to view Delta Lake table schema from v3.0.0 onwards.
- Currently, Delta Lake catalogs support the following table features:
- V2 Checkpoint (From v3.3.0 onwards)
- Timestamp without Timezone (From v3.3.1 onwards)
- Column mapping (From v3.3.6 onwards)
- Deletion Vector (From v3.4.1 onwards)
 
JDBC Catalogβ
| Catalog type | Supported Version(s) | 
|---|---|
| MySQL | v3.0+ | 
| PostgreSQL | v3.0+ | 
| ClickHouse | v3.3+ | 
| Oracle | v3.2.9+ | 
| SQL Server | v3.2.9+ | 
MySQLβ
| Feature | Supported Version(s) | 
|---|---|
| Metadata cache | v3.3+ | 
Data type correspondanceβ
| MySQL | StarRocks | Supported Version(s) | 
|---|---|---|
| BOOLEAN | BOOLEAN | v2.3+ | 
| BIT | BOOLEAN | v2.3+ | 
| SIGNED TINYINT | TINYINT | v2.3+ | 
| UNSIGNED TINYINT | SMALLINT | v3.0.6+ v3.1.2+ | 
| SIGNED SMALLINT | SMALLINT | v2.3+ | 
| UNSIGNED SMALLINT | INT | v3.0.6+ v3.1.2+ | 
| SIGNED INTEGER | INT | v2.3+ | 
| UNSIGNED INTEGER | BIGINT | v3.0.6+ v3.1.2+ | 
| SIGNED BIGINT | BIGINT | v2.3+ | 
| UNSIGNED BIGINT | LARGEINT | v3.0.6+ v3.1.2+ | 
| FLOAT | FLOAT | v2.3+ | 
| REAL | FLOAT | v3.0.1+ | 
| DOUBLE | DOUBLE | v2.3+ | 
| DECIMAL | DECIMAL32 | v2.3+ | 
| CHAR | VARCHAR(columnsize) | v2.3+ | 
| VARCHAR | VARCHAR | v2.3+ | 
| TEXT | VARCHAR(columnsize) | v3.0.1+ | 
| DATE | DATE | v2.3+ | 
| TIME | TIME | v3.1.9+ v3.2.4+ | 
| TIMESTAMP | DATETIME | v2.3+ | 
PostgreSQLβ
Data type correspondanceβ
| MySQL | StarRocks | Supported Version(s) | 
|---|---|---|
| BIT | BOOLEAN | v2.3+ | 
| SMALLINT | SMALLINT | v2.3+ | 
| INTEGER | INT | v2.3+ | 
| BIGINT | BIGINT | v2.3+ | 
| REAL | FLOAT | v2.3+ | 
| DOUBLE | DOUBLE | v2.3+ | 
| NUMERIC | DECIMAL32 | v2.3+ | 
| CHAR | VARCHAR(columnsize) | v2.3+ | 
| VARCHAR | VARCHAR | v2.3+ | 
| TEXT | VARCHAR(columnsize) | v2.3+ | 
| DATE | DATE | v2.3+ | 
| TIMESTAMP | DATETIME | v2.3+ | 
ClickHouseβ
Supported from v3.3.0 onwards.
Oracleβ
Supported from v3.2.9 onwards.
SQL Serverβ
Supported from v3.2.9 onwards.
Elasticsearch Catalogβ
Elasticsearch Catalog is supported from v3.1.0 onwards.
Paimon Catalogβ
Paimon Catalog is supported from v3.1.0 onwards.
MaxCompute Catalogβ
MaxCompute Catalog is supported from v3.3.0 onwards.
Kudu Catalogβ
Kudu Catalog is supported from v3.3.0 onwards.