加速保持基数的 Join
本主题介绍如何通过表裁剪加速保持基数的连接。此功能从 v3.1 版本开始支持。
概述
保持基数的连接确保输出行的基数和重复因子与连接中一个输入表的基数和重复因子保持一致。请考虑以下示例
-
Inner Join(内连接)
SELECT A.* FROM A INNER JOIN B ON A.fk = B.pk;
在这种情况下,
A.fk
(外键)不为 NULL 并引用B.pk
(主键)。A
中的每一行恰好匹配B
中的一行,因此输出的基数和重复因子与A
的基数和重复因子匹配。 -
Left Join(左连接)
SELECT A.* FROM A LEFT JOIN B ON A.fk = B.pk;
此处,
A.fk
引用B.pk
,但A.fk
可以包含 NULL 值。A
中的每一行最多匹配B
中的一行。因此,输出的基数和重复因子与A
保持一致。
在这些类型的连接中,如果最终输出列仅依赖于表 A
中的列,并且不使用表 B
中的列,则可以从连接中裁剪表 B
。从 v3.1 开始,StarRocks 支持保持基数连接中的 表裁剪,这可能发生在公共表表达式 (CTE)、逻辑视图和子查询中。
用例:风险控制中的实时特征选择
用于保持基数连接的表裁剪功能在诸如风险控制的 实时特征选择 等场景中特别有用。在这种情况下,用户需要从大量表中选择数据,通常处理列和表的组合爆炸。以下特征在风险控制领域很常见
- 众多特征分布在许多独立更新的表中。
- 必须实时查看和查询最新数据。
- 使用 扁平逻辑视图 来简化数据模型,使列提取的 SQL 更简洁且高效。
与其他加速数据层相比,使用扁平逻辑视图有助于用户有效地访问实时数据。在每个列提取查询中,只需要连接几个表(而不是逻辑视图中的所有表)。通过从这些查询中裁剪未使用的表,您可以减少连接的数量并提高性能。
功能支持
表裁剪功能支持 星型模式 和 雪花模式 中的多表连接。多表连接可以出现在 CTE、逻辑视图和子查询中,从而实现更有效的查询执行。
当前,表裁剪功能仅在 OLAP 表和云原生表上支持。多连接中的外部表无法裁剪。
用法
以下示例使用 TPC-H 数据集。
前提条件
要使用表裁剪功能,必须满足以下条件
- 启用表裁剪
- 设置键约束
启用表裁剪
默认情况下,表裁剪是禁用的。您需要通过配置以下会话变量来启用该功能
-- Enable RBO-phase table pruning.
SET enable_rbo_table_prune=true;
-- Enable CBO-phase table pruning.
SET enable_cbo_table_prune=true;
-- Enable RBO-phase table pruning for the UPDATE statement on the Primary Key tables.
SET enable_table_prune_on_update = true;
设置键约束
要裁剪的表必须至少在 LEFT 或 RIGHT 连接中具有唯一键或主键约束。要在 INNER JOIN 中裁剪表,除了唯一键或主键约束之外,您还必须定义外键约束。
主键表和唯一键表具有自然构建到其中的隐式主键或唯一键约束。但是,对于 Duplicate Key 表,您必须手动定义唯一键约束,并确保不存在重复行。请注意,StarRocks 不会对 Duplicate Key 表强制执行唯一键约束。相反,它将它们视为更激进查询计划的优化提示。
示例
-- Define the Unique Key constraint during table creation.
CREATE TABLE `lineitem` (
`l_orderkey` int(11) NOT NULL COMMENT "",
`l_partkey` int(11) NOT NULL COMMENT "",
`l_suppkey` int(11) NOT NULL COMMENT "",
`l_linenumber` int(11) NOT NULL COMMENT "",
`l_quantity` decimal64(15, 2) NOT NULL COMMENT "",
`l_extendedprice` decimal64(15, 2) NOT NULL COMMENT "",
`l_discount` decimal64(15, 2) NOT NULL COMMENT "",
`l_tax` decimal64(15, 2) NOT NULL COMMENT "",
`l_returnflag` varchar(1) NOT NULL COMMENT "",
`l_linestatus` varchar(1) NOT NULL COMMENT "",
`l_shipdate` date NOT NULL COMMENT "",
`l_commitdate` date NOT NULL COMMENT "",
`l_receiptdate` date NOT NULL COMMENT "",
`l_shipinstruct` varchar(25) NOT NULL COMMENT "",
`l_shipmode` varchar(10) NOT NULL COMMENT "",
`l_comment` varchar(44) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`l_orderkey`,`l_partkey`, `l_suppkey`)
DISTRIBUTED BY HASH(`l_orderkey`) BUCKETS 96
PROPERTIES (
"unique_constraints" = "l_orderkey,l_linenumber"
);
-- Or you can define the Unique Key constraint after table creation.
ALTER TABLE lineitem SET ("unique_constraints" = "l_orderkey,l_linenumber");
另一方面,外键约束必须显式定义。与 Duplicate Key 表上的唯一键约束类似,外键约束充当优化器的提示。StarRocks 不强制执行外键约束一致性。在将数据生成和摄取到 StarRocks 中时,您必须确数据完整性。
示例
-- Create the table to be referenced in the Foreign Key constraint.
-- Note the column to be referenced must have Unique Key or Primary Key constraints.
-- In this example, `p_partkey` is the Primary Key of the table `part`.
CREATE TABLE part (
p_partkey int(11) NOT NULL,
p_name VARCHAR(55) NOT NULL,
p_mfgr CHAR(25) NOT NULL,
p_brand CHAR(10) NOT NULL,
p_type VARCHAR(25) NOT NULL,
p_size INT NOT NULL,
p_container CHAR(10) NOT NULL,
p_retailprice DOUBLE NOT NULL,
p_comment VARCHAR(23) NOT NULL
) ENGINE=OLAP
PRIMARY KEY(`p_partkey`)
DISTRIBUTED BY HASH(`p_partkey`) BUCKETS 12;
-- Define the Foreign Key constraint during table creation.
CREATE TABLE `lineitem` (
`l_orderkey` int(11) NOT NULL COMMENT "",
`l_partkey` int(11) NOT NULL COMMENT "",
`l_suppkey` int(11) NOT NULL COMMENT "",
`l_linenumber` int(11) NOT NULL COMMENT "",
`l_quantity` decimal64(15, 2) NOT NULL COMMENT "",
`l_extendedprice` decimal64(15, 2) NOT NULL COMMENT "",
`l_discount` decimal64(15, 2) NOT NULL COMMENT "",
`l_tax` decimal64(15, 2) NOT NULL COMMENT "",
`l_returnflag` varchar(1) NOT NULL COMMENT "",
`l_linestatus` varchar(1) NOT NULL COMMENT "",
`l_shipdate` date NOT NULL COMMENT "",
`l_commitdate` date NOT NULL COMMENT "",
`l_receiptdate` date NOT NULL COMMENT "",
`l_shipinstruct` varchar(25) NOT NULL COMMENT "",
`l_shipmode` varchar(10) NOT NULL COMMENT "",
`l_comment` varchar(44) NOT NULL COMMENT ""
) ENGINE=OLAP
DUPLICATEK KEY(`l_orderkey`,`l_partkey`, `l_suppkey`)
DISTRIBUTED BY HASH(`l_orderkey`) BUCKETS 96
PROPERTIES (
"foreign_key_constraints" = "(l_partkey) REFERENCES part(p_partkey)"
);
-- Or you can define the Foreign Key constraint after table creation.
ALTER TABLE lineitem SET ("foreign_key_constraints" = "(l_partkey) REFERENCES part(p_partkey)");
基于唯一键或主键在 LEFT/RIGHT JOIN 中进行表裁剪
LEFT 或 RIGHT JOIN 中的表裁剪不需要连接的保留侧具有引用裁剪侧的外键。即使无法保证引用完整性,这也使得裁剪更加灵活和强大。
与基于外键的 INNER JOIN 裁剪相比,基于唯一键或主键的 LEFT/RIGHT JOIN 中的裁剪具有 较少的严格要求 。
裁剪的条件是
-
裁剪侧
裁剪的表必须是 LEFT JOIN 中的 右侧 或 RIGHT JOIN 中的 左侧。
-
连接条件
连接必须仅使用相等条件 (
=
),并且裁剪侧的连接列必须是唯一键或主键的 超集。 -
输出列
应仅输出保留侧的列,并且结果应保持与保留侧 相同的基数和重复因子 。
-
NULL/默认值
保留侧的连接列 可以 包含与裁剪侧不匹配的 NULL 或其他默认值。
示例
-
创建表并插入数据。
-- The table `depts` has its Primary Key constraint on the column `deptno`.
CREATE TABLE `depts` (
`deptno` int(11) NOT NULL COMMENT "",
`name` varchar(25) NOT NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`deptno`)
DISTRIBUTED BY HASH(`deptno`) BUCKETS 10;
CREATE TABLE `emps` (
`empid` int(11) NOT NULL COMMENT "",
`deptno` int(11) NOT NULL COMMENT "",
`name` varchar(25) NOT NULL COMMENT "",
`salary` double NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`empid`)
DISTRIBUTED BY HASH(`empid`) BUCKETS 10;
INSERT INTO depts VALUES
(1, "R&D"),
(2, "Marketing"),
(3, "Community"),
(4, "DBA"),(5, "POC");
INSERT INTO emps VALUES
(1, 1, "Alice", "6000"),
(2, 1, "Bob", "6100"),
(3, 2, "Candy", "10000"),
(4, 2, "Dave", "20000"),
(5, 3, "Evan","18000"),
(6, 3, "Freman","1000"),
(7, 4, "George","1800"),
(8, 4, "Harry","2000"),
(9, 5, "Ivan", "15000"),
(10, 5, "Jim","20000"),
(11, -1, "Kevin","1500"),
(12, -1, "Lily","2500"); -
查看查询的逻辑执行计划。
-- Q1: Query all columns of `emps` columns. Not all columns of `depts` are queried.
EXPLAIN LOGICAL SELECT emps.* FROM emps LEFT JOIN depts ON emps.deptno = depts.deptno;
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| - Output => [1:empid, 2:deptno, 3:name, 4:salary] |
| - SCAN [emps] => [1:empid, 2:deptno, 3:name, 4:salary] |
| Estimates: {row: 10, cpu: ?, memory: ?, network: ?, cost: 20.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
+-----------------------------------------------------------------------------+
-- Q2: Query only `deptno` and `salary` in `emps`. No column in `depts` is queried.
EXPLAIN LOGICAL SELECT emps.deptno, avg(salary) AS mean_salary
FROM emps LEFT JOIN depts ON emps.deptno = depts.deptno
GROUP BY emps.deptno
ORDER BY mean_salary DESC
LIMIT 5;
+-------------------------------------------------------------------------------------------------+
| Explain String |
+-------------------------------------------------------------------------------------------------+
| - Output => [2:deptno, 7:avg] |
| - TOP-5(FINAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 138.0} |
| - TOP-5(PARTIAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 114.0} |
| - AGGREGATE(GLOBAL) [2:deptno] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 90.0} |
| 7:avg := avg(7:avg) |
| - EXCHANGE(SHUFFLE) [2] |
| Estimates: {row: 6, cpu: ?, memory: ?, network: ?, cost: 72.0} |
| - AGGREGATE(LOCAL) [2:deptno] |
| Estimates: {row: 6, cpu: ?, memory: ?, network: ?, cost: 48.0} |
| 7:avg := avg(4:salary) |
| - SCAN [emps] => [2:deptno, 4:salary] |
| Estimates: {row: 12, cpu: ?, memory: ?, network: ?, cost: 12.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
+-------------------------------------------------------------------------------------------------+
-- Q3: Only columns in `emps` are queried. Although predicate `name = "R&D"` only
-- selects certain rows of `depts`, the final results only depends on `emps`.
EXPLAIN LOGICAL SELECT emps.deptno, avg(salary) AS mean_salary
FROM emps LEFT JOIN
(SELECT deptno FROM depts WHERE name="R&D") t ON emps.deptno = t.deptno
GROUP BY emps.deptno
ORDER BY mean_salary DESC
LIMIT 5;
+-------------------------------------------------------------------------------------------------+
| Explain String |
+-------------------------------------------------------------------------------------------------+
| - Output => [2:deptno, 7:avg] |
| - TOP-5(FINAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 138.0} |
| - TOP-5(PARTIAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 114.0} |
| - AGGREGATE(GLOBAL) [2:deptno] |
| Estimates: {row: 3, cpu: ?, memory: ?, network: ?, cost: 90.0} |
| 7:avg := avg(7:avg) |
| - EXCHANGE(SHUFFLE) [2] |
| Estimates: {row: 6, cpu: ?, memory: ?, network: ?, cost: 72.0} |
| - AGGREGATE(LOCAL) [2:deptno] |
| Estimates: {row: 6, cpu: ?, memory: ?, network: ?, cost: 48.0} |
| 7:avg := avg(4:salary) |
| - SCAN [emps] => [2:deptno, 4:salary] |
| Estimates: {row: 12, cpu: ?, memory: ?, network: ?, cost: 12.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
+-------------------------------------------------------------------------------------------------+
-- Q4: The predicate `depts.name="R&D"` in the WHERE clause breaches the
-- cardinality preserving conditions, so `depts` can not be pruned.
EXPLAIN LOGICAL SELECT emps.deptno, avg(salary) AS mean_salary
FROM emps LEFT JOIN depts ON emps.deptno = depts.deptno
WHERE depts.name="R&D";
+-----------------------------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------------------------+
| - Output => [8:any_value, 7:avg] |
| - AGGREGATE(GLOBAL) [] |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 110.5} |
| 7:avg := avg(7:avg) |
| 8:any_value := any_value(8:any_value) |
| - EXCHANGE(GATHER) |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 105.5} |
| - AGGREGATE(LOCAL) [] |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 101.5} |
| 7:avg := avg(4:salary) |
| 8:any_value := any_value(2:deptno) |
| - HASH/INNER JOIN [2:deptno = 5:deptno] => [2:deptno, 4:salary] |
| Estimates: {row: 12, cpu: ?, memory: ?, network: ?, cost: 79.5} |
| - SCAN [emps] => [2:deptno, 4:salary] |
| Estimates: {row: 12, cpu: ?, memory: ?, network: ?, cost: 12.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
| - EXCHANGE(BROADCAST) |
| Estimates: {row: 5, cpu: ?, memory: ?, network: ?, cost: 25.0} |
| - SCAN [depts] => [5:deptno] |
| Estimates: {row: 5, cpu: ?, memory: ?, network: ?, cost: 5.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
| predicate: 6:name = 'R&D' |
+-----------------------------------------------------------------------------------------------+
在上面的示例中,如执行计划中所示,Q1、Q2 和 Q3 允许表裁剪,而违反基数保持条件的 Q4 无法执行表裁剪。
基于外键在 INNER/LEFT/RIGHT JOIN 中进行表裁剪
INNER JOIN 中的表裁剪更具限制性,因为保留侧必须具有引用裁剪侧的 外键,并且必须确保引用完整性。目前,StarRocks 不强制执行外键约束的一致性检查。
虽然基于外键的表裁剪更严格,但它也更强大。它允许优化器在 INNER JOIN 中利用列等价推理,从而使裁剪在更复杂的场景中成为可能。
裁剪的条件如下
-
裁剪侧
- 在 LEFT JOIN 中,裁剪的表必须在右侧。在 RIGHT JOIN 中,它必须在左侧。
- 在 INNER JOIN 中,裁剪的表必须在连接列上具有唯一键约束,并且保留侧的每一行都必须精确匹配裁剪侧的一行。如果保留侧存在不匹配的行,则无法进行裁剪。
-
连接条件
连接必须仅使用相等条件 (
=
)。保留侧的连接列必须是 外键,而裁剪侧的连接列必须是 主键 或 唯一键。这些列必须根据外键约束对齐。 -
输出列
- 应仅输出保留侧的列,并且结果应保持与保留侧 相同的基数和重复因子 。
- 对于 INNER JOIN,裁剪侧的连接列可以被保留侧的等效列替换。如果裁剪侧的所有输出列都是连接列,则也可以进行裁剪。
-
NULL/默认值
保留侧的连接列 不能 包含与裁剪侧不匹配的 NULL 或其他默认值。
示例
-
创建表,定义外键,并插入数据。
CREATE TABLE `depts` (
`deptno` int(11) NOT NULL COMMENT "",
`name` varchar(25) NOT NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`deptno`)
DISTRIBUTED BY HASH(`deptno`) BUCKETS 10;
CREATE TABLE `emps` (
`empid` int(11) NOT NULL COMMENT "",
`deptno` int(11) NOT NULL COMMENT "",
`name` varchar(25) NOT NULL COMMENT "",
`salary` double NULL COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(`empid`)
DISTRIBUTED BY HASH(`empid`) BUCKETS 10;
ALTER TABLE emps SET ("foreign_key_constraints" = "(deptno) REFERENCES depts(deptno)");
INSERT INTO depts VALUES
(1, "R&D"),
(2, "Marketing"),
(3, "Community"),
(4, "DBA"),(5, "POC");
INSERT INTO emps VALUES
(1, 1, "Alice", "6000"),
(2, 1, "Bob", "6100"),
(3, 2, "Candy", "10000"),
(4, 2, "Dave", "20000"),
(5, 3, "Evan","18000"),
(6, 3, "Freman","1000"),
(7, 4, "George","1800"),
(8, 4, "Harry","2000"),
(9, 5, "Ivan", "15000"),
(10, 5, "Jim","20000"); -
查看查询的逻辑执行计划。
-- Q1: Query `empid` and `name` in `emps`, and `deptno` in `depts`.
-- However, as shown in the Join condition, `emps.deptno` is equivalent to `depts.deptno`,
-- so `emps.deptno` can substitute `depts.deptno`.
EXPLAIN LOGICAL WITH t0 AS (
SELECT empid, depts.deptno, emps.name, emps.salary, depts.name AS dept_name
FROM emps INNER JOIN depts ON emps.deptno = depts.deptno
)
SELECT empid, deptno, name FROM t0;
+-----------------------------------------------------------------------------+
| Explain String |
+-----------------------------------------------------------------------------+
| - Output => [7:empid, 8:deptno, 9:name] |
| - SCAN [emps] => [7:empid, 8:deptno, 9:name] |
| Estimates: {row: 12, cpu: ?, memory: ?, network: ?, cost: 18.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
+-----------------------------------------------------------------------------+
-- Q2: Only query `salary` in `emps`.
EXPLAIN LOGICAL SELECT avg(salary)
FROM emps INNER JOIN depts ON emps.deptno = depts.deptno;
+----------------------------------------------------------------------------------------+
| Explain String |
+----------------------------------------------------------------------------------------+
| - Output => [7:avg] |
| - AGGREGATE(GLOBAL) [] |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 16.5} |
| 7:avg := avg(7:avg) |
| - EXCHANGE(GATHER) |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 14.0} |
| - AGGREGATE(LOCAL) [] |
| Estimates: {row: 1, cpu: ?, memory: ?, network: ?, cost: 12.0} |
| 7:avg := avg(4:salary) |
| - SCAN [emps] => [4:salary] |
| Estimates: {row: 10, cpu: ?, memory: ?, network: ?, cost: 5.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
+----------------------------------------------------------------------------------------+
-- Q3: The predicate `name="R&D"` affects the cardinality of final results.
EXPLAIN LOGICAL SELECT emps.deptno, avg(salary) AS mean_salary
FROM emps INNER JOIN
(SELECT deptno FROM depts WHERE name="R&D") t ON emps.deptno = t.deptno
GROUP BY emps.deptno
ORDER BY mean_salary DESC
LIMIT 5;
+-------------------------------------------------------------------------------------------------------+
| Explain String |
+-------------------------------------------------------------------------------------------------------+
| - Output => [2:deptno, 7:avg] |
| - TOP-5(FINAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 2, cpu: ?, memory: ?, network: ?, cost: 165.0480769230769} |
| - TOP-5(PARTIAL)[7: avg DESC NULLS LAST] |
| Estimates: {row: 2, cpu: ?, memory: ?, network: ?, cost: 145.0480769230769} |
| - AGGREGATE(GLOBAL) [2:deptno] |
| Estimates: {row: 2, cpu: ?, memory: ?, network: ?, cost: 125.04807692307692} |
| 7:avg := avg(4:salary) |
| - HASH/INNER JOIN [2:deptno = 5:deptno] => [2:deptno, 4:salary] |
| Estimates: {row: 10, cpu: ?, memory: ?, network: ?, cost: 100.04807692307692} |
| - EXCHANGE(SHUFFLE) [2] |
| Estimates: {row: 10, cpu: ?, memory: ?, network: ?, cost: 50.0} |
| - SCAN [emps] => [2:deptno, 4:salary] |
| Estimates: {row: 10, cpu: ?, memory: ?, network: ?, cost: 10.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
| - EXCHANGE(SHUFFLE) [5] |
| Estimates: {row: 5, cpu: ?, memory: ?, network: ?, cost: 15.0} |
| - SCAN [depts] => [5:deptno] |
| Estimates: {row: 5, cpu: ?, memory: ?, network: ?, cost: 5.0} |
| partitionRatio: 1/1, tabletRatio: 10/10 |
| predicate: 6:name = 'R&D' |
+-------------------------------------------------------------------------------------------------------+
在上面的示例中,如执行计划中所示,Q1 和 Q2 允许表裁剪,而其谓词影响最终结果基数的 Q3 无法执行表裁剪。