版本: 最新版本-3.5

HLL (HyperLogLog)

HLL 用于近似去重计数。

HLL 支持基于 HyperLogLog 算法的程序开发。它用于存储 HyperLogLog 计算过程的中间结果。它只能用作表的值列类型。 HLL 通过聚合减少数据量，以加快查询过程。估计结果可能存在 1% 的偏差。

HLL 列基于导入的数据或其他列的数据生成。导入数据时，hll_hash 函数指定将使用哪一列来生成 HLL 列。 HLL 通常用于替换 COUNT DISTINCT，并通过 Rollup 快速计算 UV（独立访客）。

HLL 使用的存储空间由哈希值中的不同值决定。存储空间根据三种情况而变化

HLL 为空。没有值插入到 HLL 中，存储成本最低，为 80 字节。
HLL 中不同哈希值的数量小于或等于 160。最高存储成本为 1360 字节 (80 + 160 * 8 = 1360)。
HLL 中不同哈希值的数量大于 160。存储成本固定为 16,464 字节 (80 + 16 * 1024 = 16464)。

在实际业务场景中，数据量和数据分布会影响查询的内存使用情况和近似结果的准确性。您需要考虑这两个因素

数据量：HLL 返回一个近似值。更大的数据量会产生更准确的结果。较小的数据量会导致较大的偏差。
数据分布：在数据量大且 GROUP BY 的高基数列的情况下，数据计算将使用更多内存。在这种情况下不建议使用 HLL。建议在不进行 group-by 去重计数或在低基数列上进行 GROUP BY 时使用。
查询粒度：如果您以较大的查询粒度查询数据，我们建议您使用聚合表或物化视图来预聚合数据，以减少数据量。

HLL_UNION_AGG(hll): 此函数是一个聚合函数，用于估计满足条件的所有数据的基数。这也可以用于分析函数。它只支持默认窗口，不支持窗口子句。
HLL_RAW_AGG(hll): 此函数是一个聚合函数，用于聚合 hll 类型的字段，并返回 hll 类型。
HLL_CARDINALITY(hll): 此函数用于估计单个 hll 列的基数。
HLL_HASH(column_name): 这会生成 HLL 列类型，用于插入或导入。请参阅导入的使用说明。
HLL_EMPTY: 这会生成空的 HLL 列，用于在插入或导入期间填充默认值。请参阅导入的使用说明。

示例

创建包含 HLL 列 set1 和 set2 的表。

create table test(
dt date,
id int,
name char(10),
province char(10),
os char(1),
set1 hll hll_union,
set2 hll hll_union)
distributed by hash(id);

使用Stream Load加载数据。

a. Use table columns to generate an HLL column.
curl --location-trusted -uname:password -T data -H "label:load_1" \
    -H "columns:dt, id, name, province, os, set1=hll_hash(id), set2=hll_hash(name)"
http://host/api/test_db/test/_stream_load

b. Use data columns to generate an HLL column.
curl --location-trusted -uname:password -T data -H "label:load_1" \
    -H "columns:dt, id, name, province, sex, cuid, os, set1=hll_hash(cuid), set2=hll_hash(os)"
http://host/api/test_db/test/_stream_load

通过以下三种方式聚合数据：（不聚合时，直接查询基表可能与使用 approx_count_distinct 一样慢）

-- a. Create a rollup to aggregate HLL column.
alter table test add rollup test_rollup(dt, set1);

-- b. Create another table to calculate uv and insert data into it

create table test_uv(
dt date,
id int
uv_set hll hll_union)
distributed by hash(id);

insert into test_uv select dt, id, set1 from test;

-- c. Create another table to calculate UV. Insert data and generate HLL column by testing other columns through hll_hash.

create table test_uv(
dt date,
id int,
id_set hll hll_union)
distributed by hash(id);

insert into test_uv select dt, id, hll_hash(id) from test;

查询数据。 HLL 列不支持直接查询其原始值。它可以通过匹配函数查询。

a. Calculate the total UV.
select HLL_UNION_AGG(uv_set) from test_uv;

b. Calculate the UV for each day.
select dt, HLL_CARDINALITY(uv_set) from test_uv;

c. Calculate the aggregation value of set1 in the test table.
select dt, HLL_CARDINALITY(uv) from (select dt, HLL_RAW_AGG(set1) as uv from test group by dt) tmp;
select dt, HLL_UNION_AGG(set1) as uv from test group by dt;

相关函数​

示例​

您觉得这篇文档怎么样？

相关函数

示例