Feature Support: Data Lake Analytics
From v2.3 onwards, StarRocks supports managing external data sources and analyzing data in data lakes via external catalogs.
This document outlines the feature support for external catalogs and the supported version of the features involved.
Universal features
This section lists the universal features of the External Catalog feature, including storage systems, file readers, credentials, privileges, and Data Cache.
External storage systems
Storage System | Supported Version |
---|---|
HDFS | v2.3+ |
AWS S3 | v2.3+ |
Microsoft Azure Storage | v3.0+ |
Google GCS | v3.0+ |
Alibaba Cloud OSS | v3.1+ |
Huawei Cloud OBS | v3.1+ |
Tencent Cloud COS | v3.1+ |
Volcengine TOS | v3.1+ |
Kingsoft Cloud KS3 | v3.1+ |
MinIO | v3.1+ |
Ceph S3 | v3.1+ |
In addition to the native support for the storage systems listed above, StarRocks also supports the following types of object storage services:
- HDFS-compatible object storage services such as COS Cloud HDFS, OSS-HDFS, and OBS PFS
- Description: You need to specify the object storage URI prefix in the BE configuration item
fallback_to_hadoop_fs_list
, and upload the .jar package provided by the cloud vendor to the directory /lib/hadoop/hdfs/. Note that you must create the external catalog using the prefix you specified infallback_to_hadoop_fs_list
. - Supported Version(s): v3.1.9+, v3.2.4+
- Description: You need to specify the object storage URI prefix in the BE configuration item
- S3-compatible object storage services other than those listed above
- Description: You need to specify the object storage URI prefix in the BE configuration item
s3_compatible_fs_list
. Note that you must create the external catalog using the prefix you specified ins3_compatible_fs_list
. - Supported Version(s): v3.1.9+, v3.2.4+
- Description: You need to specify the object storage URI prefix in the BE configuration item
Compression formats
This section only lists the compression formats supported by each file format. For the file formats supported by each external catalog, please refer to the section on the corresponding external catalog.
File Format | Compression Formats |
---|---|
Parquet | NO_COMPRESSION, SNAPPY, LZ4, ZSTD, GZIP, LZO (v3.1.5+) |
ORC | NO_COMPRESSION, ZLIB, SNAPPY, LZO, LZ4, ZSTD |
Text | NO_COMPRESSION, LZO (v3.1.5+) |
Avro | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+) |
RCFile | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), GZIP (v3.2.1+) |
SequenceFile | NO_COMPRESSION (v3.2.1+), DEFLATE (v3.2.1+), SNAPPY (v3.2.1+), BZIP2 (v3.2.1+), GZIP (v3.2.1+) |
The Avro, RCFile, and SequenceFile file formats are read by Java Native Interface (JNI) instead of the native readers within StarRocks. Therefore, the read performance for these file formats may not be as good as that of Parquet and ORC.
Management, credential, and access control
Feature | Description | Supported Version(s) |
---|---|---|
Information Schema | Supports Information Schema for external catalogs. | v3.2+ |
Data lake access control | Supports StarRocks' native RBAC model for external catalogs. You can manage the privileges of databases, tables, and views (currently, Hive views and Iceberge views only) in external catalogs just like those in the default catalog of StarRocks. | v3.0+ |
Reuse external services on Apache Ranger | Supports reusing the external service (such as the Hive Service) on Apache Ranger for access control. | v3.1.9+ |
Kerberos authentication | Supports Kerberos authentication for HDFS or Hive Metastore. | v2.3+ |
Data Cache
Feature | Description | Supported Version(s) |
---|---|---|
Data Cache (Block Cache) | From v2.5 onwards, StarRocks supported the Data Cache feature (then called Block Cache) implemented using CacheLib, which led to limited optimization potential for its extensibility. Starting from v3.0, StarRocks refactored the cache implementation and added new features to Data Cache, resulting in better performance with each subsequent version. | v2.5+ |
Data rebalancing among local disks | Supports data rebalancing strategy to ensure that data skew is controlled under 10%. | v3.2+ |
Replace Block Cache with Data Cache | Parameter changes BE Configurations:
| v3.2+ |
New metrics for API that monitors Data Cache | Supports an individual API that monitors Data Cache including the cache capacity and hits. You can view Data Cache metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/api/datacache/stat . | v3.2.3+ |
Memory Tracker for Data Cache | Supports Memory Tracker for Data Cache. You can view the memory-related metrics via the interface http://${BE_HOST}:${BE_HTTP_PORT}/mem_tracker . | v3.1.8+ |
Data Cache Warmup | By executing CACHE SELECT, you can proactively populate the cache with the desired data from remote storage in advance to prevent the first query from taking too much time fetching the data. CACHE SELECT will not print data or incur calculations. It only fetches data. | v3.3+ |
Hive Catalog
Metadata
Hive Catalog's support for Hive Metastore (HMS) and AWS Glue mostly overlaps except that the automatic incremental update feature for HMS is not recommended. The default configuration is recommended in most cases.
The performance of metadata retrieval largely depends on the performance of the user's HMS or HDFS NameNode. Please consider all factors and base your judgment on test results.
- [Default and Recommended] Best performance with a tolerance of minute-level data inconsistency
- Configuration: You can use the default setting. Data updated within 10 minutes (by default) is not visible. Old data will be returned to queries within this duration.
- Advantage: Best query performance.
- Disadvantage: Data inconsistency caused by latency.
- Supported Version(s): v2.5.5+ (Disabled by default in v2.5 and enabled by default in v3.0+)
- Instant visibility of newly loaded data (files) without manual refresh
- Configuration: Disable the cache for the metadata of the underlying data files by setting the catalog property
enable_remote_file_cache
tofalse
. - Advantage: Visibility of file changes with no delay.
- Disadvantage: Lower performance when the file metadata cache is disabled. Each query must access the file list.
- Supported Version(s): v2.5.5+
- Configuration: Disable the cache for the metadata of the underlying data files by setting the catalog property
- Instant visibility of partition changes without manual refresh
- Configuration: Disable the cache for the Hive partition names by setting the catalog property
enable_cache_list_names
tofalse
. - Advantage: Visibility of partition changes with no delay
- Disadvantage: Lower performance when the partition name cache is disabled. Each query must access the partition list.
- Supported Version(s): v2.5.5+
- Configuration: Disable the cache for the Hive partition names by setting the catalog property
If you demand real-time updates on the data changes whilst the performance of your HMS is not optimized, you can enable the cache, disable the automatic incremental update, and manually refresh the metadata (using REFRESH EXTERNAL TABLE) via a scheduling system whenever there is a data change upstream.
Storage system
Feature | Description | Supported Version(s) |
---|---|---|
Recursive sub-directory listing | Enable recursive sub-directory listing by setting the Catalog property enable_recursive_listing to true . When recursive listing is enabled, StarRocks will read data from a table and its partitions and from the subdirectories within the physical locations of the table and its partitions. This feature is designed to address the issue of multi-layer nested directories. | v2.5.9+ v3.0.4+ (Disabled by default in v2.5 and v3.0, and enabled by default in v3.1+) |
File formats and data types
File formats
Feature | Supported File Formats |
---|---|
Read | Parquet, ORC, TEXT, Avro, RCFile, SequenceFile |
Sink | Parquet (v3.2+), ORC (v3.3+), TEXT (v3.3+) |
Data types
INTERVAL, BINARY, and UNION types are not supported.
TEXT-formatted Hive table does not support MAP and STRUCT types.