Skip to main content
Version: 3.2

Apache Iceberg Lakehouse

Overview

  • Deploy Object Storage, Apache Spark, Iceberg catalog, and StarRocks using Docker compose
  • Load New York City Green Taxi data for the month of May 2023 into the Iceberg data lake
  • Configure StarRocks to access the Iceberg catalog
  • Query the data with StarRocks where the data sits

DLA

In addition to efficient analytics of local data, StarRocks can work as the compute engine to analyze data stored in data lakes such as Apache Hudi, Apache Iceberg, and Delta Lake. One of the key features of StarRocks is its external catalog, which acts as the linkage to an externally maintained metastore. This functionality provides users with the capability to query external data sources seamlessly, eliminating the need for data migration. As such, users can analyze data from different systems such as HDFS and Amazon S3, in various file formats such as Parquet, ORC, and CSV, etc.

The preceding figure shows a data lake analytics scenario where StarRocks is responsible for data computing and analysis, and the data lake is responsible for data storage, organization, and maintenance. Data lakes allow users to store data in open storage formats and use flexible schemas to produce reports on "single source of truth" for various BI, AI, ad-hoc, and reporting use cases. StarRocks fully leverages the advantages of its vectorization engine and CBO, significantly improving the performance of data lake analytics.

Prerequisites

Docker

  • Docker
  • 5 GB RAM assigned to Docker
  • 20 GB free disk space assigned to Docker

SQL client

You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL compatible clients will work, and this guide covers the configuration of DBeaver and MySQL WorkBench.

curl

curl is used to download the datasets. Check to see if you have it installed by running curl or curl.exe at your OS prompt. If curl is not installed, get curl here.


StarRocks terminology

FE

Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs.

BE

Backend (BE) nodes are responsible for both data storage and executing query plans in shared-nothing deployments. When an external catalog (like the Iceberg catalog used in this guide) is used, the BE nodes can cache the data from the external catalog to accelerate queries.


The environment

There are six containers (services) used in this guide, and all are deployed with Docker compose. The services and their responsibilities are:

ServiceResponsibilities
starrocks-feMetadata management, client connections, query plans and scheduling
starrocks-beRunning query plans
restProviding the Iceberg catalog (metadata service)
spark-icebergAn Apache Spark environment to run PySpark in
mcMinIO configuration (MinIO command line client)
minioMinIO Object Storage

Download the Docker configuration and NYC Green Taxi data

In order to provide an environment with the three necessary containers StarRocks provides a Docker compose file. Download the compose file and dataset with curl.

The Docker compose file:

mkdir iceberg
cd iceberg
curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/iceberg/docker-compose.yml

And the dataset:

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/iceberg/datasets/green_tripdata_2023-05.parquet

Start the environment in Docker

tip

Run this command, and any other docker compose commands, from the directory containing the docker-compose.yml file.

docker compose up -d