版本: 最新版本-3.5

使用自增列和全局字典加速 COUNT(DISTINCT) 和 Join

本文介绍如何使用自增 ID 列和全局字典来加速 COUNT(DISTINCT) 计算和 Join 操作。

使用场景

场景一：假设您需要对海量数据（例如零售或送货订单）执行精确去重。但是，用于去重的列是 STRING 类型，这可能会导致计数期间的性能不佳。例如，在 orders 表中，表示订单 ID 的 order_uuid 列是 STRING 类型，通常大小为 32 到 36 字节，由 UUID() 或类似函数生成。在这种情况下，对 STRING 列 order_uuid 执行 COUNT(DISTINCT) 操作（如 SELECT count(DISTINCT order_uuid) FROM orders WHERE create_date >= CURDATE();）可能无法提供令人满意的性能。使用 INTEGER 列进行精确去重将显着提高性能。
场景二：假设您想使用 bitmap 函数加速多维分析中的精确去重。bitmap_count() 函数需要 INTEGER 输入，但如果要进行去重的列是 STRING 类型，则需要使用 bitmap_hash() 函数。这可能会导致近似的、略低的去重计数，并且还可能降低查询性能并增加存储需求，因为由 bitmap_hash() 生成的 INTEGER 值与顺序分配的 INTEGER 值相比更分散。
场景三：假设您需要查询下单和付款之间时间间隔短的订单数量，其中下单时间和付款时间可能存储在不同业务团队维护的不同表中。您可能需要基于订单 ID 连接这些表，然后对订单进行去重。例如
```
SELECT count(distinct order_uuid)
FROM orders_t1 as t1 JOIN orders_t2 as t2
    ON t1.order_uuid = t2.order_uuid
WHERE t2.payment_time - t1.create_time <= 3600
    AND create_date >= CURDATE();
```
但是，使用 STRING 类型的 order_uuid 列进行连接效率低于使用 INTEGER 列。

优化方法

为了解决上述场景中的问题，优化方法包括将订单数据加载到目标表中，并建立 STRING 和 INTEGER 值之间的映射。随后的查询分析将基于 INTEGER 列。此方法可以分为以下几个阶段

第一阶段：创建全局字典并建立 STRING 和 INTEGER 值之间的映射。在此字典中，键列为 STRING 类型，值列为 AUTO INCREMENT INTEGER 类型。加载数据时，系统会自动为每个 STRING 值生成唯一的 ID，从而创建 STRING 和 INTEGER 值之间的映射。
第二阶段：将订单数据与全局字典之间的映射关系加载到目标表中。
第三阶段：在随后的查询分析中使用目标表中的 INTEGER 列进行精确去重或连接，这可以显着提高性能。
第四阶段：为了进一步优化性能，您可以在 INTEGER 列上使用 bitmap 函数来加速精确去重。

解决方案

在 v3.2.5 之前，第二阶段可以通过两种方法实现

使用外部表或内部表作为中间表与字典表进行连接，以在加载之前获取相应的字典 ID。
使用 Primary Key 表进行数据加载，然后使用带有 JOIN 操作的 UPDATE 语句来更新字典 ID。但是，这种数据加载过程可能不方便并且有很多约束。

从 v3.2.5 开始，StarRocks 引入了 dict_mapping() 函数，允许您使用 dict_mapping() 表达式将目标表中的字典 ID 列定义为生成列。随后的数据加载任务像常规数据加载一样处理，无需使用带有 JOIN 操作的 UPDATE 语句来写入字典 ID。在数据加载期间，系统会自动将原始表与字典表关联起来，并插入相应的字典 ID，大大简化了使用全局字典表的数据加载过程，无论表类型如何，并支持各种加载方法。

业务场景

以下示例使用两个示例 CSV 文件，batch1.csv 和 batch2.csv，每个文件包含两列：id 和 order_uuid。

batch1.csv
```
1, a1
2, a2
3, a3
11, a1
11, a2
12, a1
```

batch2.csv

1, a2
2, a2
3, a2
11, a2
12, a101
12, a102
13, a102

流程

第一阶段

创建一个全局字典表，并将 CSV 文件中的订单 ID 列值加载到该表中，以建立 STRING 和 INTEGER 值之间的映射。

创建一个 Primary Key 表作为全局字典。定义主键 order_uuid（STRING 类型），以及值列 order_id_int（AUTO INCREMENT INTEGER 类型）。

信息

dict_mapping 函数要求全局字典表为 Primary Key 表。

CREATE TABLE dict (
    order_uuid STRING,
    order_id_int BIGINT AUTO_INCREMENT  -- Automatically assign an ID to each order_uuid value.
)
PRIMARY KEY (order_uuid)
DISTRIBUTED BY HASH (order_uuid)
PROPERTIES("replicated_storage" = "true");

使用 Stream Load 将两个 CSV 文件中的 order_uuid 列批量加载到字典表 dict 的 order_uuid 列中。确保您已在列模式下使用了 Partial Update。

curl --location-trusted -u root: \
    -H "partial_update: true" \
    -H "format: CSV" -H "column_separator:," -H "columns: id, order_uuid" \
    -T batch1.csv \
    -XPUT http://<fe_host>:<fe_http_port>/api/example_db/dict/_stream_load
    
curl --location-trusted -u root: \
    -H "partial_update: true" \
    -H "format: CSV" -H "column_separator:," -H "columns: id, order_uuid" \
    -T batch2.csv \
    -XPUT http://<fe_host>:<fe_http_port>/api/example_db/dict/_stream_load

注意

如果在进行到下一阶段之前将新数据添加到数据源，则必须将所有新数据加载到字典表中，以确保映射存在。

第二阶段

创建一个包含带有 dict_mapping 属性的字典 ID 列的目标表。将订单数据加载到目标表中时，系统会自动将其与字典表关联起来，并插入相应的字典 ID。

创建一个表 dest_table，其中包含 CSV 文件中的所有列。您还需要定义一个 INTEGER 列 order_id_int（通常为 BIGINT），以映射 STRING 类型的 order_uuid 列，并具有 dict_mapping 列属性。将来的查询分析将基于此 order_id_int 列。

-- In the target table, define a BIGINT dict_mapping column `order_id_int` to map with the STRING-type column `order_uuid`.
CREATE TABLE dest_table (
    id BIGINT,
    order_uuid STRING, -- This column records STRING-type Order ID.
    batch INT comment 'Used to distinguish different batch loading',
    order_id_int BIGINT AS dict_mapping('dict', order_uuid) -- Dictionary ID dict_mapping column corresponds to `order_uuid`.
)
DUPLICATE KEY (id, order_uuid)
DISTRIBUTED BY HASH(id);

使用 Stream Load 或任何其他可用方法将数据加载到目标表中。由于 order_id_int 列具有 dict_mapping 属性，因此系统将在加载期间自动从 dict 中获取字典 ID。

curl --location-trusted -u root: \
    -H "format: CSV" -H "column_separator:," -H "columns: id, order_uuid, batch=1" \
    -T batch1.csv \
    -XPUT http://<fe_host>:<fe_http_port>/api/example_db/dest_table/_stream_load
    
curl --location-trusted -u root: \
    -H "format: CSV" -H "column_separator:," -H "columns: id, order_uuid, batch=2" \
    -T batch2.csv \
    -XPUT http://<fe_host>:<fe_http_port>/api/example_db/dest_table/_stream_load

第三阶段

在查询分析期间，您可以对 INTEGER 列 order_id_int 执行精确去重或连接，与使用 STRING 列 order_uuid 相比，可以显着提高性能。

-- Exact deduplication based on BIGINT-type order_id_int.
SELECT id, COUNT(DISTINCT order_id_int) FROM dest_table GROUP BY id ORDER BY id;
-- Exact deduplication based on STRING-type order_uuid .
SELECT id, COUNT(DISTINCT order_uuid) FROM dest_table GROUP BY id ORDER BY id;

您还可以使用 bitmap 函数来加速精确去重。

使用 bitmap 函数来加速精确去重

为了进一步加速计算，您可以在创建全局字典后，将字典表的 INTEGER 列值直接插入到 bitmap 列中。随后，您可以在此 bitmap 列上使用 bitmap 函数进行精确去重。

方法一

如果您已经构建了全局字典并将订单数据导入到 dest_table 中，请按照以下步骤操作

创建一个 Aggregate 表 dest_table_bitmap，其中包含两列：BITMAP 类型的 order_id_bitmap 列，用于使用 bitmap_union() 函数进行聚合，以及 INTEGER 类型的列 id。此表不包含原始 STRING 列，否则，每个 bitmap 将仅包含一个值，从而否定了加速效果。
```
CREATE TABLE dest_table_bitmap (
    id BIGINT,
    order_id_bitmap BITMAP BITMAP_UNION
)
AGGREGATE KEY (id)
DISTRIBUTED BY HASH(id) BUCKETS 6;
```

将数据插入到 dest_table_bitmap 中。将 dest_table 的 id 列中的数据插入到 id 列中，并将字典表 dict 中的 INTEGER 列 order_id_int 数据（使用 to_bitmap() 函数处理）插入到 order_id_bitmap 列中。

INSERT INTO dest_table_bitmap (id, order_id_bitmap)
SELECT id,  to_bitmap(dict_mapping('dict', order_uuid))
FROM dest_table
WHERE dest_table.batch = 1; -- Indicates different batches.
    
INSERT INTO dest_table_bitmap (id, order_id_bitmap)
SELECT id, to_bitmap(dict_mapping('dict', order_uuid))
FROM dest_table
WHERE dest_table.batch = 2;

在 BITMAP 列上使用 BITMAP_UNION_COUNT() 函数进行精确去重。

SELECT id, BITMAP_UNION_COUNT(order_id_bitmap) FROM dest_table_bitmap
GROUP BY id ORDER BY id;

方法二

如果您在创建全局字典后不需要保留特定的订单数据，并且想要将数据直接加载到 dest_table_bitmap 表中，请按照以下步骤操作

创建一个 Aggregate 表 dest_table_bitmap，其中包含两列：BITMAP 类型的 order_id_bitmap 列，用于使用 bitmap_union() 函数进行聚合，以及 INTEGER 类型的列 id。此表不包含原始 STRING 列，否则，每个 bitmap 将仅包含一个值，从而否定了加速效果。
```
CREATE TABLE dest_table_bitmap (
    id BIGINT,
    order_id_bitmap BITMAP BITMAP_UNION
)
AGGREGATE KEY (id)
DISTRIBUTED BY HASH(id) BUCKETS 6;
```

将数据插入到 Aggregate 表中。将 CSV 文件的 id 列中的数据插入到 id 列中，并将字典表 dict 中的 INTEGER 列 order_id_int 数据（使用 to_bitmap() 函数处理）插入到 order_id_bitmap 列中。

curl --location-trusted -u root: \
    -H "format: CSV" -H "column_separator:," \
    -H "columns: id, order_uuid,  order_id_bitmap=to_bitmap(dict_mapping('dict', order_uuid))" \
    -T batch1.csv \
    -XPUT http://<fe_host>:<fe_http_port>/api/example_db/dest_table_bitmap/_stream_load

curl --location-trusted -u root: \
    -H "format: CSV" -H "column_separator:," \
    -H "columns: id, order_uuid, order_id_bitmap=to_bitmap(dict_mapping('dict', order_uuid))" \
    -T batch2.csv \
    -XPUT http:///<fe_host>:<fe_http_port>/api/example_db/dest_table_bitmap/_stream_load

在 BITMAP 列上使用 BITMAP_UNION_COUNT() 函数进行精确去重。

SELECT id, BITMAP_UNION_COUNT(order_id_bitmap) FROM dest_table_bitmap
GROUP BY id ORDER BY id;

使用场景​

优化方法​

解决方案​

业务场景​

流程​

第一阶段​

第二阶段​

第三阶段​

使用 bitmap 函数来加速精确去重​

方法一​

方法二​

您觉得这篇文档怎么样？