External table
StarRocks supports access to other data sources by using external tables. External tables are created based on data tables that are stored in other data sources. StarRocks only stores the metadata of the data tables. You can use external tables to directly query data in other data sources. StarRocks supports the following data sources: MySQL, Elasticsearch, Hive, StarRocks, Apache Iceberg, and Apache Hudi. Currently, you can only write data from another StarRocks cluster into the current StarRocks cluster. You cannot read data from it. For data sources other than StarRocks, you can only read data from these data sources.
From 2.5 onwards, StarRocks provides the Data Cache feature, which accelerates hot data queriers on external data sources. For more information, see Data Cache.
MySQL external table
In the star schema, data is generally divided into dimension tables and fact tables. Dimension tables have less data but involve UPDATE operations. Currently, StarRocks does not support direct UPDATE operations (update can be implemented by using the Unique Key table). In some scenarios, you can store dimension tables in MySQL for direct data read.
To query MySQL data, you must create an external table in StarRocks and map it to the table in your MySQL database. You need to specify the MySQL connection information when creating the table.
CREATE EXTERNAL TABLE mysql_external_table
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=mysql
PROPERTIES
(
"host" = "127.0.0.1",
"port" = "3306",
"user" = "mysql_user",
"password" = "mysql_passwd",
"database" = "mysql_db_test",
"table" = "mysql_table_test"
);
Parameters:
- host: the connection address of the MySQL database
- port: the port number of the MySQL database
- user: the username to log in to MySQL
- password: the password to log in to MySQL
- database: the name of the MySQL database
- table: the name of the table in the MySQL database
StarRocks external table
From StarRocks 1.19 onwards, StarRocks allows you to use a StarRocks external table to write data from one StarRocks cluster to another. This achieves read-write separation and provides better resource isolation. You can first create a destination table in the destination StarRocks cluster. Then, in the source StarRocks cluster, you can create a StarRocks external table that has the same schema as the destination table and specify the information of the destination cluster and table in the PROPERTIES
field.
Data can be written from a source cluster to a destination cluster by using INSERT INTO statement to write into a StarRocks external table. It can help realize the following goals:
- Data synchronization between StarRocks clusters.
- Read-write separation. Data is written to the source cluster, and data changes from the source cluster are synchronized to the destination cluster, which provides query services.
The following code shows how to create a destination table and an external table.
# Create a destination table in the destination StarRocks cluster.
CREATE TABLE t
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=olap
DISTRIBUTED BY HASH(k1) BUCKETS 10;
# Create an external table in the source StarRocks cluster.
CREATE EXTERNAL TABLE external_t
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=olap
DISTRIBUTED BY HASH(k1) BUCKETS 10
PROPERTIES
(
"host" = "127.0.0.1",
"port" = "9020",
"user" = "user",
"password" = "passwd",
"database" = "db_test",
"table" = "t"
);
# Write data from a source cluster to a destination cluster by writing data into the StarRocks external table. The second statement is recommended for the production environment.
insert into external_t values ('2020-10-11', 1, 1, 'hello', '2020-10-11 10:00:00');
insert into external_t select * from other_table;
Parameters:
-
EXTERNAL: This keyword indicates that the table to be created is an external table.
-
host: This parameter specifies the IP address of the leader FE node of the destination StarRocks cluster.
-
port: This parameter specifies the RPC port of the FE node of the destination StarRocks cluster.
noteTo ensure that the source cluster to which the StarRocks external tables belong can access the destination StarRocks cluster, you must configure your network and firewall to allow access to the following ports:
- The RPC port of the FE node. See
rpc_port
in the FE configuration file fe/fe.conf. The default RPC port is9020
. - The bRPC port of the BE node. See
brpc_port
in the BE configuration file be/be.conf. The default bRPC port is8060
.
- The RPC port of the FE node. See
-
user: This parameter specifies the username used to access the destination StarRocks cluster.
-
password: This parameter specifies the password used to access the destination StarRocks cluster.
-
database: This parameter specifies the database to which the destination table belongs.
-
table: This parameter specifies the name of the destination table.
The following limits apply when you use a StarRocks external table:
- You can only run the INSERT INTO and SHOW CREATE TABLE commands on a StarRocks external table. Other data writing methods are not supported. In addition, you cannot query data from a StarRocks external table or perform DDL operations on the external table.
- The syntax of creating an external table is the same as creating a normal table, but the column names and other information in the external table must be the same as the destination table.
- The external table synchronizes table metadata from the destination table every 10 seconds. If a DDL operation is performed on the destination table, there may be a delay for data synchronization between the two tables.
Elasticsearch external table
StarRocks and Elasticsearch are two popular analytics systems. StarRocks is performant in large-scale distributed computing. Elasticsearch is ideal for full-text search. StarRocks combined with Elasticsearch can deliver a more complete OLAP solution.
Example of creating an Elasticsearch external table
Syntax
CREATE EXTERNAL TABLE elastic_search_external_table
(
k1 DATE,
k2 INT,
k3 SMALLINT,
k4 VARCHAR(2048),
k5 DATETIME
)
ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:9200,http://192.168.0.2:9200",
"user" = "root",
"password" = "root",
"index" = "tindex",
"type" = "_doc",
"es.net.ssl" = "true"
);
The following table describes the parameters.
Parameter | Required | Default value | Description |
---|---|---|---|
hosts | Yes | None | The connection address of the Elasticsearch cluster. You can specify one or more addresses. StarRocks can parse the Elasticsearch version and index shard allocation from this address. StarRocks communicates with your Elasticsearch cluster based on the address returned by the GET /_nodes/http API operation. Therefore, the value of the host parameter must be the same as the address returned by the GET /_nodes/http API operation. Otherwise, BEs may not be able to communicate with your Elasticsearch cluster. |
index | Yes | None | The name of the Elasticsearch index that is created on the table in StarRocks. The name can be an alias. This parameter supports wildcards (*). For example, if you set index to hello* , StarRocks retrieves all indexes whose names start with hello . |
user | No | Empty | The username that is used to log in to the Elasticsearch cluster with basic authentication enabled. Make sure you have access to /*cluster/state/*nodes/http and the index. |
password | No | Empty | The password that is used to log in to the Elasticsearch cluster. |
type | No | _doc | The type of the index. Default value: _doc . If you want to query data in Elasticsearch 8 and later versions, you do not need to configure this parameter because the mapping types have been removed in Elasticsearch 8 and later versions. |
es.nodes.wan.only | No | false | Specifies whether StarRocks only uses the addresses specified by hosts to access the Elasticsearch cluster and fetch data.
|
es.net.ssl | No | false | Specifies whether the HTTPS protocol can be used to access your Elasticsearch cluster. Only StarRocks 2.4 and later versions support configuring this parameter.
|
enable_docvalue_scan | No | true | Specifies whether to obtain the values of the target fields from Elasticsearch columnar storage. In most cases, reading data from columnar storage outperforms reading data from row storage. |
enable_keyword_sniff | No | true | Specifies whether to sniff TEXT-type fields in Elasticsearch based on KEYWORD-type fields. If this parameter is set to false , StarRocks performs matching after tokenization. |
Columnar scan for faster queries
If you set enable_docvalue_scan
to true
, StarRocks follows these rules when it obtains data from Elasticsearch:
- Try and see: StarRocks automatically checks if columnar storage is enabled for the target fields. If so, StarRocks obtains all values in the target fields from columnar storage.
- Auto-downgrading: If any one of the target fields is unavailable in columnar storage, StarRocks parses and obtains all values in the target fields from row storage (
_source
).
NOTE
- Columnar storage is unavailable for TEXT-type fields in Elasticsearch. Therefore, if you query fields containing TEXT-type values, StarRocks obtains the values of the fields from
_source
.- If you query a large number (greater than or equal to 25) of fields, reading field values from
docvalue
does not show noticeable benefits compared with reading field values from_source
.
Sniff KEYWORD-type fields
If you set enable_keyword_sniff
to true
, Elasticsearch allows direct data ingestion without an index because it will automatically create an index after ingestion. For STRING-type fields, Elasticsearch will create a field with both TEXT and KEYWORD types. This is how the Multi-Field feature of Elasticsearch works. The mapping is as follows:
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
For example, to conduct "=" filtering on k4
, StarRocks on Elasticsearch will convert the filtering operation into an Elasticsearch TermQuery.
The original SQL filter is as follows:
k4 = "StarRocks On Elasticsearch"
The converted Elasticsearch query DSL is as follows:
"term" : {
"k4": "StarRocks On Elasticsearch"
}
The first field of k4
is TEXT, and it will be tokenized by the analyzer configured for k4
(or by the standard analyzer if no analyzer has been configured for k4
) after data ingestion. As a result, the first field will be tokenized into three terms: StarRocks
, On
, and Elasticsearch
. The details are as follows:
POST /_analyze
{
"analyzer": "standard",
"text": "StarRocks On Elasticsearch"
}
The tokenization results are as follows:
{
"tokens": [
{
"token": "starrocks",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "on",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elasticsearch",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Suppose you conduct a query as follows:
"term" : {
"k4": "StarRocks On Elasticsearch"
}
There is no term in the dictionary that matches the term StarRocks On Elasticsearch
, and therefore no result will be returned.
However, if you have set enable_keyword_sniff
to true
, StarRocks will convert k4 = "StarRocks On Elasticsearch"
to k4.keyword = "StarRocks On Elasticsearch"
to match the SQL semantics. The converted StarRocks On Elasticsearch
query DSL is as follows:
"term" : {
"k4.keyword": "StarRocks On Elasticsearch"
}
k4.keyword
is of the KEYWORD type. Therefore, the data is written into Elasticsearch as a complete term, allowing for successful matching.
Mapping of column data types
When you create an external table, you need to specify the data types of columns in the external table based on the data types of columns in the Elasticsearch table. The following table shows the mapping of column data types.
Elasticsearch | StarRocks |
---|---|
BOOLEAN | BOOLEAN |
BYTE | TINYINT/SMALLINT/INT/BIGINT |
SHORT | SMALLINT/INT/BIGINT |
INTEGER | INT/BIGINT |
LONG | BIGINT |
FLOAT | FLOAT |
DOUBLE | DOUBLE |
KEYWORD | CHAR/VARCHAR |
TEXT | CHAR/VARCHAR |
DATE | DATE/DATETIME |
NESTED | CHAR/VARCHAR |
OBJECT | CHAR/VARCHAR |
ARRAY | ARRAY |
Note
- StarRocks reads the data of the NESTED type by using JSON-related functions.
- Elasticsearch automatically flattens multi-dimensional arrays into one-dimensional arrays. StarRocks does the same. The support for querying ARRAY data from Elasticsearch is added from v2.5.
Predicate pushdown
StarRocks supports predicate pushdown. Filters can be pushed down to Elasticsearch for execution, which improves query performance. The following table lists the operators that support predicate pushdown.
SQL syntax | ES syntax |
---|---|
= | term query |
in | terms query |
>=, <=, >, < | range |
and | bool.filter |
or | bool.should |
not | bool.must_not |
not in | bool.must_not + terms |
esquery | ES Query DSL |
Examples
The esquery function is used to push down queries that cannot be expressed in SQL (such as match and geoshape) to Elasticsearch for filtering. The first parameter in the esquery function is used to associate an index. The second parameter is a JSON expression of basic Query DSL, which is enclosed in brackets . The JSON expression must have but only one root key, such as match, geo_shape, or bool.
- match query
select * from es_table where esquery(k4, '{
"match": {
"k4": "StarRocks on elasticsearch"
}
}');
- geo-related query
select * from es_table where esquery(k4, '{
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates": [
[
13,
53
],
[
14,
52
]
]
},
"relation": "within"
}
}
}');
- bool query
select * from es_table where esquery(k4, ' {
"bool": {
"must": [
{
"terms": {
"k1": [
11,
12
]
}
},
{
"terms": {
"k2": [
100
]
}
}
]
}
}');
Usage notes
- Elasticsearch earlier than 5.x scans data in a different way than that later than 5.x. Currently, only versions later than 5.x are supported.
- Elasticsearch clusters with HTTP basic authentication enabled are supported.
- Querying data from StarRocks may not be as fast as directly querying data from Elasticsearch, such as count-related queries. The reason is that Elasticsearch directly reads the metadata of target documents without the need to filter the real data, which accelerates the count query.
External table for a JDBC-compatible database
From v2.3.0, StarRocks provides external tables to query JDBC-compatible databases. This way, you can analyze the data of such databases in a blazing fast manner without the need to import the data into StarRocks. This section describes how to create an external table in StarRocks and query data in JDBC-compatible databases.