Load data using Kafka connector
StarRocks provides a self-developed connector named Apache Kafka® connector (StarRocks Connector for Apache Kafka®, Kafka connector for short), as a sink connector, that continuously consumes messages from Kafka and loads them into StarRocks. The Kafka connector guarantees at-least-once semantics.
The Kafka connector can seamlessly integrate with Kafka Connect, which allows StarRocks better integrated with the Kafka ecosystem. It is a wise choice if you want to load real-time data into StarRocks. Compared with Routine Load, it is recommended to use the Kafka connector in the following scenarios:
- Compared with Routine Load which only supports loading data in CSV, JSON, and Avro formats, Kafka connector can load data in more formats, such as Protobuf. As long as data can be converted into JSON and CSV formats using Kafka Connect's converters, data can be loaded into StarRocks via the Kafka connector.
- Customize data transformation, such as Debezium-formatted CDC data.
- Load data from multiple Kafka topics.
- Load data from Confluent Cloud.
- Need finer control over load batch sizes, parallelism, and other parameters to achieve a balance between load speed and resource utilization.
Preparations
Version requirements
| Connector | Kafka | StarRocks | Java |
|---|---|---|---|
| 1.0.4 | 3.4 | 2.5 and later | 8 |
| 1.0.3 | 3.4 | 2.5 and later | 8 |
Set up Kafka environment
Both self-managed Apache Kafka clusters and Confluent Cloud are supported.
- For a self-managed Apache Kafka cluster, you can refer to Apache Kafka quickstart to quickly deploy a Kafka cluster. Kafka Connect is already integrated into Kafka.
- For Confluent Cloud, make sure that you have a Confluent account and have created a cluster.
Download Kafka connector
Submit the Kafka connector into Kafka Connect:
-
Self-managed Kafka cluster:
Download and extract starrocks-kafka-connector-xxx.tar.gz.
-
Confluent Cloud:
Currently, the Kafka connector is not uploaded to Confluent Hub. You need to download and extract starrocks-kafka-connector-xxx.tar.gz, package it into a ZIP file and upload the ZIP file to Confluent Cloud.
Network configuration
Ensure that the machine where Kafka is located can access the FE nodes of the StarRocks cluster via the http_port (default: 8030) and query_port (default: 9030), and the BE nodes via the be_http_port (default: 8040).
Usage
This section uses a self-managed Kafka cluster as an example to explain how to configure the Kafka connector and the Kafka Connect, and then run the Kafka Connect to load data into StarRocks.
Prepare a dataset
Suppose that JSON-format data exists in the topic test in a Kafka cluster.
{"id":1,"city":"New York"}
{"id":2,"city":"Los Angeles"}
{"id":3,"city":"Chicago"}