Read data from StarRocks using Spark connector
StarRocks provides a self-developed connector named StarRocks Connector for Apache Spark™ (Spark connector for short) to help you read data from a StarRocks table by using Spark. You can use Spark for complex processing and machine learning on the data you have read from StarRocks.
The Spark connector supports three reading methods: Spark SQL, Spark DataFrame, and Spark RDD.
You can use Spark SQL to create a temporary view on the StarRocks table, and then directly read data from the StarRocks table by using that temporary view.
You can also map the StarRocks table to a Spark DataFrame or a Spark RDD, and then read data from the Spark DataFrame or Spark RDD. We recommend the use of a Spark DataFrame.
NOTICE
Only users with the SELECT privilege on a StarRocks table can read data from this table. You can follow the instructions provided in GRANT to grant the privilege to a user.
Usage notes
- You can filter data on StarRocks before you read the data, thereby reducing the amount of data transferred.
- If the overhead of reading data is substantial, you can employ appropriate table design and filter conditions to prevent Spark from reading an excessive amount of data at a time. As such, you can reduce I/O pressure on your disk and network connection, thereby ensuring routine queries can be run properly.
Version requirements
Spark connector | Spark | StarRocks | Java | Scala |
---|---|---|---|---|
1.1.2 | 3.2, 3.3, 3.4, 3.5 | 2.5 and later | 8 | 2.12 |
1.1.1 | 3.2, 3.3, 3.4 | 2.5 and later | 8 | 2.12 |
1.1.0 | 3.2, 3.3, 3.4 | 2.5 and later | 8 | 2.12 |
1.0.0 | 3.x | 1.18 and later | 8 | 2.12 |
1.0.0 | 2.x | 1.18 and later | 8 | 2.11 |
NOTICE
- Please see Upgrade Spark connector for behaviour changes among different connector versions.
- The connector does not provide MySQL JDBC driver since version 1.1.1, and you need import the driver to the Spark classpath manually. You can find the driver on Maven Central.
- In version 1.0.0, the Spark connector only supports reading data from StarRocks. From version 1.1.0 onwards, the Spark connector supports both reading data from and writing data to StarRocks.
- Version 1.0.0 differs from version 1.1.0 in terms of parameters and data type mappings. See Upgrade Spark connector.
- In general cases, no new features will be added to version 1.0.0. We recommend that you upgrade your Spark connector at your earliest opportunity.
Obtain Spark connector
Use one of the following methods to obtain the Spark connector .jar package that suits your business needs:
- Download a compiled package.
- Use Maven to add the dependencies required by the Spark connector. (This method is supported only for Spark connector 1.1.0 and later.)
- Manually compile a package.
Spark connector 1.1.0 and later
Spark connector .jar packages are named in the following format:
starrocks-spark-connector-${spark_version}_${scala_version}-${connector_version}.jar
For example, if you want to use Spark connector 1.1.0 with Spark 3.2 and Scala 2.12, you can choose starrocks-spark-connector-3.2_2.12-1.1.0.jar
.
NOTICE
In normal cases, the latest Spark connector version can be used with the most recent three Spark versions.
Download a compiled package
You can obtain Spark connector .jar packages of various versions at Maven Central Repository.
Add Maven dependencies
Configure the dependencies required by the Spark connector as follows:
NOTICE
You must replace
spark_version
,scala_version
, andconnector_version
with the Spark version, Scala version, and Spark connector version you use.
<dependency>
<groupId>com.starrocks</groupId>
<artifactId>starrocks-spark-connector-${spark_version}_${scala_version}</artifactId>
<version>${connector_version}</version>
</dependency>
For example, if you want to use Spark connector 1.1.0 with Spark 3.2 and Scala 2.12, configure the dependencies as follows:
<dependency>
<groupId>com.starrocks</groupId>
<artifactId>starrocks-spark-connector-3.2_2.12</artifactId>
<version>1.1.0</version>
</dependency>
Manually compile a package
-
Download the Spark connector code.
-
Use the following command to compile the Spark connector:
NOTICE
You must replace
spark_version
with the Spark version you use.sh build.sh <spark_version>
For example, if you want to use the Spark connector with Spark 3.2, compile the Spark connector as follows:
sh build.sh 3.2
-
Go to the
target/
path, in which a Spark connector .jar package likestarrocks-spark-connector-3.2_2.12-1.1.0-SNAPSHOT.jar
is generated upon compilation.NOTICE
If you are using a Spark connector version that is not officially released, the name of the generated Spark connector .jar package contains
SNAPSHOT
as a suffix.
Spark connector 1.0.0
Download a compiled package
Manually compile a package
-
Download the Spark connector code.
NOTICE
You must switch to
spark-1.0
. -
Take one of the following actions to compile the Spark connector:
-
If you are using Spark 2.x, run the following command, which compiles the Spark connector to suit Spark 2.3.4 by default:
sh build.sh 2
-
If you are using Spark 3.x, run the following command, which compiles the Spark connector to suit Spark 3.1.2 by default:
sh build.sh 3
-
-
Go to the
output/
path, in which thestarrocks-spark2_2.11-1.0.0.jar
file is generated upon compilation. Then, copy the file to the classpath of Spark:- If your Spark cluster runs in
Local
mode, place the file into thejars/
path. - If your Spark cluster runs in
Yarn
mode, place the file into the pre-deployment package.
- If your Spark cluster runs in
You can use the Spark connector to read data from StarRocks only after you place the file into the specified location.
Parameters
This section describes the parameters you need to configure when you use the Spark connector to read data from StarRocks.