Skew Join V2
Skew Join V2 is an advanced optimization feature in StarRocks that addresses data skew issues in JOIN operations by broadcasting skew values. This feature significantly improves query performance when dealing with heavily skewed data distributions.
Overviewβ
Data skew occurs when certain values in join columns appear much more frequently than others, leading to uneven data distribution across nodes and causing performance bottlenecks. Skew Join V2 solves this problem by:
- Identifying skew values: Manually specifying values that cause data skew
- Broadcasting skew values: Broadcasting these specific values to all nodes to ensure even data distribution
- Hybrid execution: Using a combination of shuffle and broadcast joins for optimal performance
Skew Join V2 creates a hybrid execution plan that combines:
- Shuffle Join: For non-skew data, uses the standard shuffle-based join
- Broadcast Join: For skew values, broadcasts the right table data to all nodes
Enabling Skew Join V2β
By default, the Skew Join V2 optimization is disabled, and its predecessor, Skew Join V1, is enabled. If you find that the performance of Skew Join V1 is not satisfactory, you must disable Skew Join V1 before enabling Skew Join V2.
-- Disable Skew Join V1
SET enable_optimize_skew_join_v1 = false;
-- Enable Skew Join V2
SET enable_optimize_skew_join_v2 = true;
After enabling Skew Join V2, you can then update your queries based on the syntax with explicitly specified skew values.
Currently, Skew Join V2 does not support automatic plan rewrite based on statistics. Only hint-based manual SQL rewrite is supported.
Usageβ
Compared to the query rewrite method of Skew Join V1, Skew Join V2 offers a new syntax that allows you to explicitly specify the skew values with Broadcast.
Syntax:
SELECT select_list FROM
table1 JOIN [skew|table1.column(skew_value1, skew_value2, ...)] table2
ON join_condition
[WHERE where_clause]
Parameters:
[skew|table1.column(skew_value1, skew_value2, ...)]: The skew hint. It includes:
table1.column: The column from the left table that contains skew values.skew_value1, skew_value2, ...: A comma-separated list of values that cause data skew.
Do not omit the brackets in the skew hint.
Examplesβ
Basic Usageβ
-
Create test tables.
CREATE TABLE orders (
order_id INT,
customer_id INT,
order_date DATE,
amount DECIMAL(10,2)
) DUPLICATE KEY(order_id)
DISTRIBUTED BY HASH(order_id) BUCKETS 8;
CREATE TABLE customers (
customer_id INT,
customer_name VARCHAR(100),
city VARCHAR(50)
) DUPLICATE KEY(customer_id)
DISTRIBUTED BY HASH(customer_id) BUCKETS 8; -
Insert sample data with skew.
INSERT INTO orders VALUES
(1, 1001, '2024-01-01', 100.00),
(2, 1001, '2024-01-02', 200.00),
(3, 1001, '2024-01-03', 150.00),
(4, 1002, '2024-01-01', 300.00),
(5, 1003, '2024-01-01', 250.00);
INSERT INTO customers VALUES
(1001, 'John Doe', 'New York'),
(1002, 'Jane Smith', 'Los Angeles'),
(1003, 'Bob Johnson', 'Chicago'); -
Query data with Skew Join V2.
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
JOIN [skew|o.customer_id(1001)] c
ON o.customer_id = c.customer_id;
Multiple Skew Valuesβ
Skew Join V2 supports multiple skew values.
SELECT o.order_id, c.customer_name, o.amount
FROM orders o
JOIN [skew|o.customer_id(1001, 1002, 1003)] c
ON o.customer_id = c.customer_id;
Different Data Typesβ
Skew Join V2 supports various data types for skew values.
-- String values
SELECT t1.id, t2.name
FROM table1 t1
JOIN [skew|t1.category('electronics', 'clothing', 'books')] t2
ON t1.category = t2.category;
-- Date values
SELECT t1.id, t2.event_name
FROM events t1
JOIN [skew|t1.event_date('2024-01-01', '2024-01-02')] t2
ON t1.event_date = t2.event_date;
-- Numeric values
SELECT t1.id, t2.region
FROM sales t1
JOIN [skew|t1.region_id(1, 2, 3)] t2
ON t1.region_id = t2.region_id;
Complex Join Conditionsβ
Skew Join V2 supports complex Join conditions.
-- Join condition with complex expressions
SELECT t1.id, t2.value
FROM table1 t1
JOIN [skew|t1.key(abs(t1.key))] t2
ON abs(t1.key) = abs(t2.key);
-- Multiple Join conditions
SELECT t1.id, t2.name
FROM table1 t1
JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category AND t1.region = t2.region;
Different Join Typesβ
Skew Join V2 supports various Join types.
-- Left Join
SELECT t1.id, t2.name
FROM table1 t1
LEFT JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
-- Left Semi Join
SELECT t1.id
FROM table1 t1
LEFT SEMI JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
-- Left Anti Join
SELECT t1.id
FROM table1 t1
LEFT ANTI JOIN [skew|t1.category('electronics')] t2
ON t1.category = t2.category;
Best Practicesβ
Step 1. Identify Skew Valuesβ
Before using Skew Join V2, identify the values that cause data skew:
-- Analyze data distribution
SELECT customer_id, COUNT(*) as cnt
FROM orders
GROUP BY customer_id
ORDER BY cnt DESC
LIMIT 10;
Step 2. Choose Appropriate Skew Valuesβ
Follow these rules to choose the appropriate skew values:
- Include only the most frequent values that cause significant skew.
- Avoid including too many values as it may increase broadcast overhead.
- Consider the trade-off between skew optimization and broadcast cost.
Limitationsβ
- Supported Join Types Currently, only INNER JOIN, LEFT JOIN, LEFT SEMI JOIN, and LEFT ANTI JOIN are supported.
- Data Types Currently, only basic data types (INT, BIGINT, STRING, DATE, DATETIME, and string types) are supported.
- Complex Expressions The support for complex expressions in Join conditions is limited.
- Skew Table Skew Join V2 can only handle scenarios where the large table in the JOIN operation is skewed, and the large table must be used as the left table.
Troubleshootingβ
Debug Informationβ
Execute EXPLAIN VERBOSE against the query to collect its execution plan:
EXPLAIN VERBOSE
SELECT ... FROM ... JOIN [skew|...] ...;
Check the following fields in the plan:
SplitCastDataSink: Data splitting.BROADCASTandSHUFFLE: Distribution type.UNION: Union results of both Join types.