版本: 最新版本-3.5

CREATE STORAGE VOLUME

CREATE STORAGE VOLUME 语句用于创建远端存储卷。该功能自 v3.1 版本起支持。

存储卷由远端数据存储的属性和凭据信息组成。您可以在共享数据 StarRocks 集群中创建数据库和云原生表时引用存储卷。

注意

只有拥有 SYSTEM 级别 CREATE STORAGE VOLUME 权限的用户才能执行此操作。

语法

CREATE STORAGE VOLUME [IF NOT EXISTS] <storage_volume_name>
TYPE = { S3 | HDFS | AZBLOB | ADLS2 | GS }
LOCATIONS = ('<remote_storage_path>')
[ COMMENT '<comment_string>' ]
PROPERTIES
("key" = "value",...)

参数

参数	描述
storage_volume_name	存储卷的名称。请注意，您不能创建名为 `builtin_storage_volume` 的存储卷，因为它用于创建内置存储卷。有关命名约定，请参见系统限制。
TYPE	远端存储系统的类型。有效值：`S3`、`HDFS`、`AZBLOB`、`ADLS2` 和 `GS`。`S3` 表示 AWS S3 或 S3 兼容的存储系统。`AZBLOB` 表示 Azure Blob Storage（自 v3.1.1 版本起支持）。`ADLS2` 表示 Azure Data Lake Storage Gen2（自 v3.4.1 版本起支持）。`HDFS` 表示 HDFS 集群。`GS` 表示 Google Storage（使用原生 SDK 访问，自 v3.5.1 版本起支持）。
LOCATIONS	存储位置。格式如下对于 AWS S3 或 S3 协议兼容的存储系统：`s3://<s3_path>`。`<s3_path>` 必须是绝对路径，例如，`s3://testbucket/subpath`。请注意，如果要为存储卷启用 Partitioned Prefix 功能，则只能指定存储桶名称，不允许指定子路径。对于 Azure Blob Storage：`azblob://<azblob_path>`。`<azblob_path>` 必须是绝对路径，例如，`azblob://testcontainer/subpath`。对于 Azure Data Lake Storage Gen2：`adls2://<file_system_name>/<dir_name>`。示例：`adls2://testfilesystem/starrocks`。对于使用原生 SDK 的 GS：`gs://<gs_path>`。`<gs_path>` 必须是绝对路径，例如，`gs://testcbucket/subpath`。对于 HDFS：`hdfs://<host>:<port>/<hdfs_path>`。`<hdfs_path>` 必须是绝对路径，例如，`hdfs://127.0.0.1:9000/user/xxx/starrocks`。对于 WebHDFS：`webhdfs://<host>:<http_port>/<hdfs_path>`，其中 `<http_port>` 是 NameNode 的 HTTP 端口。`<hdfs_path>` 必须是绝对路径，例如，`webhdfs://127.0.0.1:50070/user/xxx/starrocks`。对于 ViewFS：`viewfs://<ViewFS_cluster>/<viewfs_path>`，其中 `<ViewFS_cluster>` 是 ViewFS 集群名称。`<viewfs_path>` 必须是绝对路径，例如，`viewfs://myviewfscluster/user/xxx/starrocks`。
COMMENT	存储卷的注释。
PROPERTIES	`"key" = "value"` 对中的参数，用于指定访问远端存储系统的属性和凭据信息。有关详细信息，请参见 PROPERTIES。

PROPERTIES

下表列出了存储卷的所有可用属性。表格后是这些属性的用法说明，从凭据信息和功能的角度，按不同场景进行分类。

属性	描述
enabled	是否启用此存储卷。默认值：`false`。禁用的存储卷无法被引用。
aws.s3.region	您的 S3 存储桶所在的区域，例如 `us-west-2`。
aws.s3.endpoint	用于访问您的 S3 存储桶的终端节点 URL，例如 `https://s3.us-west-2.amazonaws.com`。[预览] 自 v3.3.0 版本起，支持 Amazon S3 Express One Zone 存储类，例如 `https://s3express.us-west-2.amazonaws.com`。 Beta 功能 Beta 功能使用建议
aws.s3.use_aws_sdk_default_behavior	是否使用 AWS SDK 的默认身份验证凭据。有效值：`true` 和 `false`（默认）。
aws.s3.use_instance_profile	是否使用实例配置文件和承担角色作为访问 S3 的凭据方法。有效值：`true` 和 `false`（默认）。如果使用基于 IAM 用户的凭据（访问密钥和密钥）访问 S3，则必须将此项指定为 `false`，并指定 `aws.s3.access_key` 和 `aws.s3.secret_key`。如果您使用 Instance Profile 访问 S3，则必须将此项指定为 `true`。如果使用承担角色访问 S3，则必须将此项指定为 `true`，并指定 `aws.s3.iam_role_arn`。如果使用外部 AWS 账户，则必须将此项指定为 `true`，并指定 `aws.s3.iam_role_arn` 和 `aws.s3.external_id`。
aws.s3.access_key	用于访问您的 S3 存储桶的访问密钥 ID。
aws.s3.secret_key	用于访问您的 S3 存储桶的秘密访问密钥。
aws.s3.iam_role_arn	在其中存储数据文件的 S3 存储桶上具有权限的 IAM 角色的 ARN。
aws.s3.external_id	用于跨帐户访问您的 S3 存储桶的 AWS 帐户的外部 ID。
azure.blob.endpoint	您的 Azure Blob Storage 账户的终端节点，例如 `https://test.blob.core.windows.net`。
azure.blob.shared_key	用于授权您的 Azure Blob Storage 请求的共享密钥。
azure.blob.sas_token	用于授权您的 Azure Blob Storage 请求的共享访问签名 (SAS)。
azure.adls2.endpoint	您的 Azure Data Lake Storage Gen2 账户的终端节点，例如 `https://test.dfs.core.windows.net`。
azure.adls2.shared_key	用于授权您的 Azure Data Lake Storage Gen2 请求的共享密钥。
azure.adls2.sas_token	用于授权您的 Azure Data Lake Storage Gen2 请求的共享访问签名 (SAS)。
azure.adls2.oauth2_use_managed_identity	是否使用托管身份来授权您的 Azure Data Lake Storage Gen2 请求。默认值：`false`。
azure.adls2.oauth2_tenant_id	用于授权您的 Azure Data Lake Storage Gen2 请求的托管身份的租户 ID。
azure.adls2.oauth2_client_id	用于授权您的 Azure Data Lake Storage Gen2 请求的托管身份的客户端 ID。
gcp.gcs.service_account_email	创建服务账户时生成的 JSON 文件中的电子邮件地址，例如 `user@hello.iam.gserviceaccount.com`。
gcp.gcs.service_account_private_key_id	创建服务账户时生成的 JSON 文件中的私钥 ID。
gcp.gcs.service_account_private_key	创建服务账户时生成的 JSON 文件中的私钥，例如 `-----BEGIN PRIVATE KEY----xxxx-----END PRIVATE KEY-----\n`。
gcp.gcs.impersonation_service_account	如果您使用基于模拟的身份验证，则要模拟的服务账户。
gcp.gcs.use_compute_engine_service_account	是否使用绑定到您的 Compute Engine 的服务账户。
hadoop.security.authentication	身份验证方法。有效值：`simple`（默认）和 `kerberos`。`simple` 表示简单身份验证，即用户名。`kerberos` 表示 Kerberos 身份验证。
username	用于访问 HDFS 集群中 NameNode 的用户名。
hadoop.security.kerberos.ticket.cache.path	存储 kinit 生成的票据缓存的路径。
dfs.nameservices	HDFS 集群的名称。
dfs.ha.namenodes.`<ha_cluster_name>`	NameNode 的名称。多个名称必须用逗号 (,) 分隔。双引号中不允许有空格。`<ha_cluster_name>` 是 `dfs.nameservices` 中指定的 HDFS 服务名称。
dfs.namenode.rpc-address.`<ha_cluster_name>`.`<NameNode>`	NameNode 的 RPC 地址信息。`<NameNode>` 是 `dfs.ha.namenodes.<ha_cluster_name>` 中指定的 NameNode 的名称。
dfs.client.failover.proxy.provider	用于客户端连接的 NameNode 的提供程序。默认值为 `org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider`。
fs.viewfs.mounttable.`<ViewFS_cluster>`.link./`<viewfs_path>`	要挂载的 ViewFS 集群的路径。多个路径必须用逗号 (,) 分隔。`<ViewFS_cluster>` 是 `LOCATIONS` 中指定的 ViewFS 集群名称。
aws.s3.enable_partitioned_prefix	是否为存储卷启用 Partitioned Prefix 功能。默认值：`false`。有关此功能的更多信息，请参见 Partitioned Prefix。
aws.s3.num_partitioned_prefix	要为存储卷创建的前缀数量。默认值：`256`。有效范围：[4, 1024]。

凭据信息

AWS S3

如果使用 AWS SDK 的默认身份验证凭据访问 S3，请设置以下属性

"enabled" = "{ true | false }",
"aws.s3.region" = "<region>",
"aws.s3.endpoint" = "<endpoint_url>",
"aws.s3.use_aws_sdk_default_behavior" = "true"

如果使用基于 IAM 用户的凭据（访问密钥和密钥）访问 S3，请设置以下属性

"enabled" = "{ true | false }",
"aws.s3.region" = "<region>",
"aws.s3.endpoint" = "<endpoint_url>",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "false",
"aws.s3.access_key" = "<access_key>",
"aws.s3.secret_key" = "<secrete_key>"

如果使用实例配置文件访问 S3，请设置以下属性

"enabled" = "{ true | false }",
"aws.s3.region" = "<region>",
"aws.s3.endpoint" = "<endpoint_url>",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "true"

如果使用承担角色访问 S3，请设置以下属性

"enabled" = "{ true | false }",
"aws.s3.region" = "<region>",
"aws.s3.endpoint" = "<endpoint_url>",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "<role_arn>"

如果使用外部 AWS 账户承担角色访问 S3，请设置以下属性

"enabled" = "{ true | false }",
"aws.s3.region" = "<region>",
"aws.s3.endpoint" = "<endpoint_url>",
"aws.s3.use_aws_sdk_default_behavior" = "false",
"aws.s3.use_instance_profile" = "true",
"aws.s3.iam_role_arn" = "<role_arn>",
"aws.s3.external_id" = "<external_id>"

MinIO

如果使用 MinIO，请设置以下属性

"enabled" = "{ true | false }",

-- For example: us-east-1
"aws.s3.region" = "<region>",

-- For example: http://172.26.xx.xxx:39000
"aws.s3.endpoint" = "<endpoint_url>",

"aws.s3.access_key" = "<access_key>",
"aws.s3.secret_key" = "<secrete_key>"

Azure Blob Storage

自 v3.1.1 版本起支持在 Azure Blob Storage 上创建存储卷。

如果使用共享密钥访问 Azure Blob Storage，请设置以下属性

"enabled" = "{ true | false }",
"azure.blob.endpoint" = "<endpoint_url>",
"azure.blob.shared_key" = "<shared_key>"

如果使用共享访问签名 (SAS) 访问 Azure Blob Storage，请设置以下属性

"enabled" = "{ true | false }",
"azure.blob.endpoint" = "<endpoint_url>",
"azure.blob.sas_token" = "<sas_token>"

注意

创建 Azure Blob Storage 账户时，必须禁用分层命名空间。

Azure Data Lake Storage Gen2

自 v3.4.1 版本起支持在 Azure Data Lake Storage Gen2 上创建存储卷。

如果使用共享密钥访问 Azure Data Lake Storage Gen2，请设置以下属性

"enabled" = "{ true | false }",
"azure.adls2.endpoint" = "<endpoint_url>",
"azure.adls2.shared_key" = "<shared_key>"

如果使用共享访问签名 (SAS) 访问 Azure Data Lake Storage Gen2，请设置以下属性

"enabled" = "{ true | false }",
"azure.adls2.endpoint" = "<endpoint_url>",
"azure.adls2.sas_token" = "<sas_token>"

如果使用托管身份访问 Azure Data Lake Storage Gen2，请设置以下属性

"enabled" = "{ true | false }",
"azure.adls2.endpoint" = "<endpoint_url>",
"azure.adls2.oauth2_use_managed_identity" = "true",
"azure.adls2.oauth2_tenant_id" = "<tenant_id>",
"azure.adls2.oauth2_client_id" = "<client_id>" 

注意

不支持 Azure Data Lake Storage Gen1。

Google Storage

如果使用绑定到您的 Compute Engine 的服务账户访问 Google Storage（自 v3.5.1 起支持），请设置以下属性
```
"enabled" = "{ true | false }",
"gcp.gcs.use_compute_engine_service_account" = "true"
```

如果使用基于服务账户的身份验证方法访问 Google Storage（自 v3.5.1 起支持），请设置以下属性

"enabled" = "{ true | false }",
"gcp.gcs.use_compute_engine_service_account" = "false",
"gcp.gcs.service_account_email" = "<google_service_account_email>",
"gcp.gcs.service_account_private_key_id" = "<google_service_private_key_id>",
"gcp.gcs.service_account_private_key" = "<google_service_private_key>"

如果使用基于模拟的身份验证访问 Google Storage（自 v3.5.1 起支持），请设置以下属性

"enabled" = "{ true | false }",
"gcp.gcs.use_compute_engine_service_account" = "false",
"gcp.gcs.service_account_email" = "<google_service_account_email>",
"gcp.gcs.service_account_private_key_id" = "<google_service_private_key_id>",
"gcp.gcs.service_account_private_key" = "<google_service_private_key>",
"gcp.gcs.impersonation_service_account" = "<assumed_google_service_account_email>"

如果使用 S3 协议和基于 IAM 用户的身份验证访问 Google Storage，请设置以下属性

提示
Google Storage 支持使用 XML API，并且设置使用 AWS S3 语法。在这种情况下，您必须将 TYPE 设置为 S3，并将 LOCATIONS 设置为 S3 协议兼容的存储位置。
```
"enabled" = "{ true | false }",

-- For example: us-east1
"aws.s3.region" = "<region>",

-- For example: https://storage.googleapis.com
"aws.s3.endpoint" = "<endpoint_url>",

"aws.s3.access_key" = "<access_key>",
"aws.s3.secret_key" = "<secrete_key>"
```

HDFS

如果您不使用身份验证访问 HDFS，请设置以下属性
```
"enabled" = "{ true | false }"
```

如果您使用简单身份验证（自 v3.2 起支持）访问 HDFS，请设置以下属性

"enabled" = "{ true | false }",
"hadoop.security.authentication" = "simple",
"username" = "<hdfs_username>"

如果您使用 Kerberos 票据缓存身份验证（自 v3.2 起支持）访问 HDFS，请设置以下属性
```
"enabled" = "{ true | false }",
"hadoop.security.authentication" = "kerberos",
"hadoop.security.kerberos.ticket.cache.path" = "<ticket_cache_path>"
```
注意
- 此设置仅强制系统使用 KeyTab 通过 Kerberos 访问 HDFS。确保每个 BE 或 CN 节点都可以访问 KeyTab 文件。还要确保正确设置 /etc/krb5.conf 文件。
- 票据缓存由外部 kinit 工具生成。确保您有一个 crontab 或类似的定期任务来刷新票据。

如果您的 HDFS 集群启用了 NameNode HA 配置（自 v3.2 起支持），请额外设置以下属性

"dfs.nameservices" = "<ha_cluster_name>",
"dfs.ha.namenodes.<ha_cluster_name>" = "<NameNode1>,<NameNode2> [, ...]",
"dfs.namenode.rpc-address.<ha_cluster_name>.<NameNode1>" = "<hdfs_host>:<hdfs_port>",
"dfs.namenode.rpc-address.<ha_cluster_name>.<NameNode2>" = "<hdfs_host>:<hdfs_port>",
[...]
"dfs.client.failover.proxy.provider.<ha_cluster_name>" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"

有关更多信息，请参见 HDFS HA 文档。

如果您使用 WebHDFS（自 v3.2 起支持），请设置以下属性

"enabled" = "{ true | false }"

有关更多信息，请参见 WebHDFS 文档。

如果您使用 Hadoop ViewFS（自 v3.2 起支持），请设置以下属性

-- Replace <ViewFS_cluster> with the name of the ViewFS cluster.
"fs.viewfs.mounttable.<ViewFS_cluster>.link./<viewfs_path_1>" = "hdfs://<hdfs_host_1>:<hdfs_port_1>/<hdfs_path_1>",
"fs.viewfs.mounttable.<ViewFS_cluster>.link./<viewfs_path_2>" = "hdfs://<hdfs_host_2>:<hdfs_port_2>/<hdfs_path_2>",
[, ...]

有关更多信息，请参见 ViewFS 文档。

功能

Partitioned Prefix

自 v3.2.4 起，StarRocks 支持为 S3 兼容的对象存储系统创建具有 Partitioned Prefix 功能的存储卷。启用此功能后，StarRocks 会将数据存储到存储桶下的多个、统一前缀的分区（子路径）中。它可以轻松地倍增 StarRocks 在存储桶中存储的数据文件上的读写性能，因为存储桶的 QPS 或吞吐量限制是按分区计算的。

要启用此功能，请在上述凭据相关参数之外设置以下属性

"aws.s3.enable_partitioned_prefix" = "{ true | false }",
"aws.s3.num_partitioned_prefix" = "<INT>"

注意

Partitioned Prefix 功能仅支持 S3 兼容的对象存储系统，即存储卷的 TYPE 必须为 S3。
存储卷的 LOCATIONS 必须仅包含存储桶名称，例如 s3://testbucket。不允许在存储桶名称后指定子路径。
创建存储卷后，这两个属性都是不可变的。
使用 FE 配置文件 fe.conf 创建存储卷时，无法启用此功能。

示例

示例 1：为 AWS S3 存储桶 defaultbucket 创建存储卷 my_s3_volume，使用基于 IAM 用户的凭据（访问密钥和密钥）访问 S3，并启用它。

CREATE STORAGE VOLUME my_s3_volume
TYPE = S3
LOCATIONS = ("s3://defaultbucket/test/")
PROPERTIES
(
    "aws.s3.region" = "us-west-2",
    "aws.s3.endpoint" = "https://s3.us-west-2.amazonaws.com",
    "aws.s3.use_aws_sdk_default_behavior" = "false",
    "aws.s3.use_instance_profile" = "false",
    "aws.s3.access_key" = "xxxxxxxxxx",
    "aws.s3.secret_key" = "yyyyyyyyyy"
);

示例 2：为 HDFS 创建存储卷 my_hdfs_volume 并启用它。

CREATE STORAGE VOLUME my_hdfs_volume
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES
(
    "enabled" = "true"
);

示例 3：使用简单身份验证为 HDFS 创建存储卷 hdfsvolumehadoop。

CREATE STORAGE VOLUME hdfsvolumehadoop
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES(
    "hadoop.security.authentication" = "simple",
    "username" = "starrocks"
);

示例 4：使用 Kerberos 票据缓存身份验证访问 HDFS 并创建存储卷 hdfsvolkerberos。

CREATE STORAGE VOLUME hdfsvolkerberos
TYPE = HDFS
LOCATIONS = ("hdfs://127.0.0.1:9000/sr/test/")
PROPERTIES(
    "hadoop.security.authentication" = "kerberos",
    "hadoop.security.kerberos.ticket.cache.path" = "/path/to/ticket/cache/path"
);

示例 5：为启用了 NameNode HA 配置的 HDFS 集群创建存储卷 hdfsvolha。

CREATE STORAGE VOLUME hdfsvolha
TYPE = HDFS
LOCATIONS = ("hdfs://myhacluster/data/sr")
PROPERTIES(
    "dfs.nameservices" = "myhacluster",
    "dfs.ha.namenodes.myhacluster" = "nn1,nn2,nn3",
    "dfs.namenode.rpc-address.myhacluster.nn1" = "machine1.example.com:8020",
    "dfs.namenode.rpc-address.myhacluster.nn2" = "machine2.example.com:8020",
    "dfs.namenode.rpc-address.myhacluster.nn3" = "machine3.example.com:8020",
    "dfs.namenode.http-address.myhacluster.nn1" = "machine1.example.com:9870",
    "dfs.namenode.http-address.myhacluster.nn2" = "machine2.example.com:9870",
    "dfs.namenode.http-address.myhacluster.nn3" = "machine3.example.com:9870",
    "dfs.client.failover.proxy.provider.myhacluster" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);

示例 6：为 WebHDFS 创建存储卷 webhdfsvol。

CREATE STORAGE VOLUME webhdfsvol
TYPE = HDFS
LOCATIONS = ("webhdfs://namenode:9870/data/sr");

示例 7：使用 Hadoop ViewFS 创建存储卷 viewfsvol。

CREATE STORAGE VOLUME viewfsvol
TYPE = HDFS
LOCATIONS = ("viewfs://clusterX/data/sr")
PROPERTIES(
    "fs.viewfs.mounttable.clusterX.link./data" = "hdfs://nn1-clusterx.example.com:8020/data",
    "fs.viewfs.mounttable.clusterX.link./project" = "hdfs://nn2-clusterx.example.com:8020/project"
);

示例 8：使用 SAS 令牌为 Azure Data Lake Storage Gen2 创建存储卷 adls2。

CREATE STORAGE VOLUME adls2
TYPE = ADLS2
LOCATIONS = ("adls2://testfilesystem/starrocks")
PROPERTIES (
    "azure.adls2.endpoint" = "https://test.dfs.core.windows.net",
    "azure.adls2.sas_token" = "xxx"
);

示例 9：使用模拟的服务账户为 Google Storage 创建存储卷 gs。

CREATE STORAGE VOLUME gs
TYPE = GS
LOCATIONS = ("gs://testbucket/starrocks")
PROPERTIES (
    "gcp.gcs.use_compute_engine_service_account" = "false",
    "gcp.gcs.service_account_email" = "user@hello.iam.gserviceaccount.com",
    "gcp.gcs.service_account_private_key_id" = "61d257bd847xxxxxxxxxxxxxxx4f0b9b6b9ca07af3b7ea",
    "gcp.gcs.service_account_private_key" = "-----BEGIN PRIVATE KEY----xxxx-----END PRIVATE KEY-----\n",
    "gcp.gcs.impersonation_service_account" = "admin@hello.iam.gserviceaccount.com"
);

CREATE STORAGE VOLUME

语法

参数

PROPERTIES

凭据信息

AWS S3

MinIO

Azure Blob Storage

Azure Data Lake Storage Gen2

Google Storage

HDFS

功能

Partitioned Prefix

示例

相关 SQL 语句

您觉得这篇文档怎么样？

语法​

参数​

PROPERTIES​

凭据信息​

AWS S3​

MinIO​

Azure Blob Storage​

Azure Data Lake Storage Gen2​

Google Storage​

HDFS​

功能​

Partitioned Prefix​

示例​

相关 SQL 语句​

您觉得这篇文档怎么样？

语法

参数

PROPERTIES

凭据信息

AWS S3

MinIO

Azure Blob Storage

Azure Data Lake Storage Gen2

Google Storage

HDFS

功能

Partitioned Prefix

示例

相关 SQL 语句