Asynchronous materialized views
This topic describes how to understand, create, use, and manage an asynchronous materialized view. Asynchronous materialized views are supported from StarRocks v2.4 onwards.
Compared with synchronous materialized views, asynchronous materialized views support multi-table join and more aggregate functions. The refresh of asynchronous materialized views can be triggered manually, regualrly at pre-specified intervals, or automatically at data changes in the base table. You can also refresh some of the partitions instead of the whole materialized view, greatly reducing the cost of refresh. In addition, asynchronous materialized views support a variety of query rewrite scenarios, allowing automatic, transparent query acceleration.
For the scenario and usage of the synchronous materialized views (Rollup), see Synchronous materialized view (Rollup).
Overview
Applications in databases often perform complex queries on large tables. Such queries involve multi-table joins and aggregations on tables that contain billions of rows. Processing these queries can be expensive, in terms of system resources and the time it takes to compute the results.
Asynchronous materialized views in StarRocks are designed to tackle these issues. An asynchronous materialized view is a special physical table that holds pre-computed query results from one or more base tables. When you perform complex queries on the base table, StarRocks returns the pre-computed results from the relevant materialized views to process these queries. This way, query performance can be improved because repetitive complex calculations are avoided. This performance difference can be significant when a query is run frequently or is sufficiently complex.
Additionally, asynchronous materialized views are especially useful for building mathematical models upon your data warehouse. By doing so, you can provide a unified data specification to upper-layer applications, shield the underlying implementation, or protect the raw data security of the base tables.
Understand materialized views in StarRocks
StarRocks v2.3 and earlier versions provided a synchronous materialized view that can be built only on a single table. Synchronous materialized views, or the Rollup, retain higher data freshness and lower refreshing costs. However, compared to asynchronous materialized views supported from v2.4 onwards, synchronous materialized views have many limitations. You have limited choices of aggregation operators when you want to build a synchronous materialized view to accelerate or rewrite your queries.
The following table compares the asynchronous materialized views (ASYNC MV) and the synchronous materialized view (SYNC MV) in StarRocks in the perspective of features that they support:
Single-table aggregation | Multi-table join | Query rewrite | Refresh strategy | Base table | |
---|---|---|---|---|---|
ASYNC MV | Yes | Yes | Yes |
| Multiple tables from:
|
SYNC MV (Rollup) | Limited choices of aggregate functions | No | Yes | Synchronous refresh during data loading | Single table in the default catalog |
Basic concepts
-
Base table
Base tables are the driving tables of a materialized view.
For StarRocks' asynchronous materialized views, base tables can be StarRocks native tables in the default catalog, tables in external catalogs (supported from v2.5), or even existing asynchronous materialized views (supported from v2.5) and views (supported from v3.1). StarRocks supports creating asynchronous materialized views on all types of StarRocks tables.
-
Refresh
When you create an asynchronous materialized view, its data reflects only the state of the base tables at that time. When the data in the base tables change, you need to refresh the materialized view to keep the changes synchronized.
Currently, StarRocks supports two generic refreshing strategies:
- ASYNC: Asynchronous refresh mode. Materialized views can be refreshed automatically when the data in the base table changes, or at regular intervals based on specified intervals.
- MANUAL: Manual refresh mode. The materialized view will not be automatically refreshed. The refresh tasks can only be triggered manually by users.
-
Query rewrite
Query rewrite means that when executing a query on base tables with materialized views built on, the system automatically judges whether the pre-computed results in the materialized view can be reused for the query. If they can be reused, the system will load the data directly from the relevant materialized view to avoid the time- and resource-consuming computations or joins.
From v2.5, StarRocks supports automatic, transparent query rewrite based on the SPJG-type asynchronous materialized views. The SPJG-type materialized views refer to materialized views whose plan only includes Scan, Filter, Project, and Aggregate types of operators.
NOTE
Asynchronous materialized views created on base tables in a JDBC catalog or a Hudi catalog do not support query rewrite.
Decide when to create a materialized view
You can create an asynchronous materialized view if you have the following demands in your data warehouse environment:
-
Accelerating queries with repetitive aggregate functions
Suppose that most queries in your data warehouse include the same sub-query with an aggregate function, and these queries have consumed a huge proportion of your computing resources. Based on this sub-query, you can create an asynchronous materialized view, which will compute and store all results of the sub-query. After the materialized view is built, StarRocks rewrites all queries that contain the sub-query, loads the intermediate results stored in the materialized view, and thus accelerates these queries.
-
Regular JOIN of multiple tables
Suppose that you need to regularly join multiple tables in your data warehouse to make a new wide table. You can build an asynchronous materialized view for these tables, and set the ASYNC refreshing strategy that triggers refreshing tasks at a fixed time interval. After the materialized view is built, query results are returned directly from the materialized view, and thus the latency caused by JOIN operations is avoided.
-
Data warehouse layering
Suppose that your data warehouse contains a mass of raw data, and queries in it require a complex set of ETL operations. You can build multiple layers of asynchronous materialized views to stratify the data in your data warehouse, and thus decompose the query into a series of simple sub-queries. It can significantly reduce repetitive computation, and, more importantly, help your DBA identify the problem with ease and efficiency. Beyond that, data warehouse layering helps decouple raw data and statistical data, protecting the security of sensitive raw data.
-
Accelerating queries in data lakes
Querying a data lake can be slow due to network latency and object storage throughput. You can enhance the query performance by building an asynchronous materialized view on top of the data lake. Moreover, StarRocks can intelligently rewrite queries to use the existing materialized views, saving you the trouble of modifying your queries manually.
For specific use cases of asynchronous materialized views, refer to the following content:
Create an asynchronous materialized view
StarRocks' asynchronous materialized views can be created on the following base tables:
- StarRocks' native tables (all StarRocks table types are supported)
- Tables in external catalogs
- Existing asynchronous materialized views (Since v2.5)
- Existing views (Since v3.1)
Before you begin
The following examples involve two base tables in the default catalog:
- The table
goods
records the item IDitem_id1
, the item nameitem_name
, and the item priceprice
. - The table
order_list
records the order IDorder_id
, client IDclient_id
, item IDitem_id2
, and order dateorder_date
.
The column goods.item_id1
is equivalent to the column order_list.item_id2
.
Execute the following statements to create the tables and insert data into them:
CREATE TABLE goods(
item_id1 INT,
item_name STRING,
price FLOAT
) DISTRIBUTED BY HASH(item_id1);
INSERT INTO goods
VALUES
(1001,"apple",6.5),
(1002,"pear",8.0),
(1003,"potato",2.2);
CREATE TABLE order_list(
order_id INT,
client_id INT,
item_id2 INT,
order_date DATE
) DISTRIBUTED BY HASH(order_id);
INSERT INTO order_list
VALUES
(10001,101,1001,"2022-03-13"),
(10001,101,1002,"2022-03-13"),
(10002,103,1002,"2022-03-13"),
(10002,103,1003,"2022-03-14"),
(10003,102,1003,"2022-03-14"),
(10003,102,1001,"2022-03-14");
The scenario in the following example demands frequent calculations of the total of each order. It requires frequent joins of the two base tables and intensive usage of the aggregate function sum()
. Besides, the business scenario demands the data refresh at an interval of one day.
The query statement is as follows:
SELECT
order_id,
sum(goods.price) as total
FROM order_list INNER JOIN goods ON goods.item_id1 = order_list.item_id2
GROUP BY order_id;