Version: Latest-3.5

Partitioning

Fast analytics in StarRocks begin with a table layout that matches your query patterns. This guide distills hands‑on experience into clear rules for partitioning, helping you:

Scan less data via aggressive partition pruning
Manage lifecycle tasks (TTL, GDPR deletes, tiering) with metadata‑only ops;
Scale smoothly as tenant counts, data volume, or retention windows grow.
Controlled write amplification–New data lands in the “hot” partition; compaction happens in historical partitions

Keep this advice close when modeling a new table or refactoring an old one—each section gives purpose‑driven criteria, design heuristics, and operational guard‑rails so you avoid costly re‑partitioning down the road.

Partitioning vs. Bucketing–different jobs

Understanding the distinction between partitioning and bucketing is fundamental when designing performant StarRocks tables. While both help manage large datasets, they serve different purposes:

Partitioning allows StarRocks to skip entire blocks of data at query time using partition pruning, and enables metadata-only lifecycle operations like dropping old or tenant-specific data.
Bucketing, on the other hand, helps distribute data across tablets to parallelize query execution and balance load, especially when combined with hash functions.

Aspect	Partitioning	Bucketing (Hash/Random)
Primary goal	Coarse‑grain data pruning and lifecycle control(TTL, archiving).	Fine‑grain parallelism and data locality inside each partition.
Planner visibility	Partitions are catalog objects; FE can skip them via predicates.	Only equality predicates support bucket pruning
Lifecycle ops	DROP PARTITION is metadata‑only—ideal for GDPR deletes, monthly roll‑off.	Buckets can’t be dropped; they change only with `ALTER TABLE … MODIFY DISTRIBUTED BY`.
Typical count	10^2–10^4 per table (days, weeks, tenants).	10–120 per partition; StarRocks `BUCKETS AUTO` tunes this.
Skew handling	Merge or split partitions; consider composite/hybrid scheme.	Raise bucket count, hash on compound key, isolate “whales”, or use random bucketing
Red flags	>100 k partitions can introduce significant memory footprint for FE	>200 k tablets per BE; tablets exceeding 10 GB may encounter compaction issues.

When should I partition?

Table type	Partition?	Typical key
Fact / event stream	Yes	`date_trunc('day', event_time)`
Huge dimension (billions rows)	Sometimes	Time or business key change date
Small dimension / lookup	No	Rely on hash distribution

Choosing the partition key

Time‑first default–If 80 % of queries include a time filter, lead with date_trunc('day', dt).
Tenant isolation–Add tenant_id into the key when you need to manage the data in tenant basis
Retention alignment–Put the column you plan to purge on into the key.
Composite keys: PARTITION BY (tenant_id, date_trunc('day', dt)) prunes perfectly but creates #tenants × #days partitions. Keep below ≈ 100 k total or FE memory & BE compaction suffer.

Picking granularity

The granularity of PARTITION BY date_trunc('day', dt) should be adjusted based on the use case. You can use "hour," "day," or "month," etc. See date_trunc

Granularity	Use when	Pros	Cons
Daily (default)	Most BI & reporting	Few partitions (365/yr); simple TTL	Less precise for "last 3 h" queries
Hourly	> 2 × tablet per day; IoT bursts	Hot‑spot isolation; 24 partitions/day	8 700 partitions/yr
Weekly / Monthly	Historical archive	Tiny metadata; merges easy	Coarser pruning

Rule of thumb: Keep each partition ≤ 100 GB and ≤ 20 k tablets/partition across replicas.
Mixed granularity: Starting from version 3.4, StarRocks supports mixed granularity by merging historical partitions into coarser granularity.

Example recipes

Click‑stream fact (single‑tenant)

CREATE TABLE click_stream (
  user_id BIGINT,
  event_time DATETIME,
  url STRING,
  ...
)
DUPLICATE KEY(user_id, event_time)
PARTITION BY (date_trunc('day', event_time))
DISTRIBUTED BY HASH(user_id) BUCKETS AUTO;

SaaS metrics (multi‑tenant, pattern A)

Recommended for most SaaS workloads. Prunes on time, keeps tenant rows co‑located.

CREATE TABLE metrics (
  tenant_id INT,
  dt DATETIME,
  metric_name STRING,
  v DOUBLE
)
PRIMARY KEY(tenant_id, dt, metric_name)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS AUTO;

Whale tenant composite (pattern B)

When tenant-specific DML/DDL is necessary or large-scale tenants are present, be cautious of potential partition explosion.

CREATE TABLE activity (
  tenant_id INT,
  dt DATETIME,
  id BIGINT,
  ....
)
DUPLICATE KEY(dt, id)
PARTITION BY (tenant_id, date_trunc('MONTH', dt))
DISTRIBUTED BY HASH(id) BUCKETS AUTO;

Partitioning vs. Bucketing–different jobs​

When should I partition?​

Choosing the partition key​

Picking granularity​

Example recipes​

Click‑stream fact (single‑tenant)​

SaaS metrics (multi‑tenant, pattern A)​

Whale tenant composite (pattern B)​

What did you think of this doc?