S3
描述
S3 表函数(table-valued-function,tvf),可以让用户像访问关系表格式数据一样,读取并访问 S3 兼容的对象存储上的文件内容。目前支持csv/csv_with_names/csv_with_names_and_types/json/parquet/orc
文件格式。
语法
S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)
必选参数
参数 | 描述 |
---|---|
uri | 用于访问 S3 的 URI,该函数会根据 use_path_style 参数来决定使用路径样式(Path Style)还是虚拟托管样式(Virtual-hosted Style)进行访问 |
s3.access_key | 访问 S3 的访问密钥 |
s3.secret_key | 访问 S3 的秘密密钥 |
s3.region | S3 存储所在的区域 |
s3.endpoint | S3 存储的端点地址 |
format | 文件格式,支持 csv/csv_with_names/csv_with_names_and_types/json/parquet/orc |
可选参数
参数 | 描述 | 备注 |
---|---|---|
s3.session_token | S3 session token | |
use_path_style | 默认为false ,S3 SDK 默认使用 Virtual-hosted Syle 方式。但某些对象存储系统可能没开启或没支持 Virtual-hosted Style 方式的访问,此时我们可以添加 use_path_style 参数来强制使用 Path Style 方式。比如 minio 默认情况下只允许 path style 访问方式,所以在访问 MinIO 时要加上 use_path_style=true | |
force_parsing_by_standard_uri | 默认为 false ,用于强制将非标准的 URI 解析为标准 URI | |
column_separator | 列分割符,默认为\t | |
line_delimiter | 行分割符,默认为\n | |
compress_type | 目前支持 UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK。默认值为 UNKNOWN, 将会根据 uri 的后缀自动推断类型 | |
read_json_by_line | 默认为 "true" ,适用于 JSON 格式的导入 | JSON Load |
strip_outer_array | 默认为 "false" ,适用于 JSON 格式的导入 | JSON Load |
json_root | 默认为空,适用于 JSON 格式的导入 | JSON Load |
jsonpaths | 默认为空,适用于 JSON 格式的导入 | JSON Load |
num_as_string | 默认为 false ,适用于 JSON 格式的导入 | JSON Load |
fuzzy_parse | 默认为 false ,适用于 JSON 格式的导入 | |
trim_double_quotes | 默认为 false ,用于 CSV 格式的导入,裁剪掉每个字段最外层的双引号 | |
skip_lines | 默认为 0,表示跳过 CSV 文件的前几行。适用于 csv_with_names 或 csv_with_names_and_types 格式时失效 | |
path_partition_keys | 指定文件路径中携带的分区列名,例如 /path/to/city=beijing/date="2023-07-09" , 则填写 path_partition_keys="city,date" ,将会自动从路径中读取相应列名和列值进行导入 | |
resource | 指定 Resource 名,S3 TVF 可以利用已有的 S3 Resource 来直接访问 S3。创建 S3 Resource 的方法可以参照 CREATE-RESOURCE。该功能自 2.1.4 版本开始支持。 |
注意事项
- 对于 AWS S3,标准 uri styles 有以下几种:
- AWS Client Style(Hadoop S3 Style):
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
。- Virtual Host Style:
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
。- Path Style:
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
。除了支持以上三个标准常见的 URI Styles, 还支持其他一些 URI Styles(也许不常见,但也有可能有):
- Virtual Host AWS Client (Hadoop S3) Mixed Style: >>
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
- Path AWS Client (Hadoop S3) Mixed Style: > >
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
详细使用案例可以参考下方的示例
- 直接查询 TVF 或基于该 TVF 创建 View,需要拥有该 Resource 的 USAGE 权限,查询基于 TVF 创建的 View,只需要该 View 的 SELECT 权限。
示例
-
读取并访问 S3 兼容的对象存储上的 CSV 格式文件
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true") order by c1;
-
可以配合
desc function
使用Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true");
-
不同 url schema 的写法
http:// 、https:// 使用示例:
-- 注意 URI Bucket 写法以及`use_path_style`参数设置,HTTP 同理。 -- 由于设置了 `"use_path_style"="true"`, 所以将采用 Path Style 的方式访问 S3。 select * from s3( "uri" = "https://endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "endpoint", "s3.region" = "region", "format" = "csv", "use_path_style"="true"); -- 注意 URI Bucket 写法以及 use_path_style 参数设置,http 同理。 -- 由于设置了 `"use_path_style"="false"`, 所以将采用 Virtual-hosted Style 方式访问 S3。 select * from s3( "uri" = "https://bucket.endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "endpoint", "s3.region" = "region", "format" = "csv", "use_path_style"="false"); -- 阿里云 OSS 和腾讯云 COS 采用 Virtual-hosted Style 方式访问 S3。 -- OSS select * from s3( "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "oss-cn-beijing.aliyuncs.com", "s3.region" = "oss-cn-beijing", "format" = "parquet", "use_path_style" = "false"); -- COS select * from s3( "uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "cos.ap-hongkong.myqcloud.com", "s3.region" = "ap-hongkong", "format" = "parquet", "use_path_style" = "false"); -- MinIO select * from s3( "uri" = "s3://bucket/file.csv", "s3.endpoint" = "http://172.21.0.101:9000", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.region" = "us-east-1", "format" = "csv" ); -- 百度云 BOS 采用兼容 S3 协议的 Virtual-hosted Style 方式访问 S3。 -- BOS select * from s3( "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.region" = "bj", "s3.endpoint" = "http://bj.bcebos.com", "format" = "parquet", "use_path_style" = "false");
-
s3:// 使用示例:
-- 注意 URI Bucket 写法,无需设置 `use_path_style` 参数。 -- 将采用 Virtual-hosted Style 方式访问 S3。 select * from s3( "uri" = "s3://bucket/file/student.csv", "s3.endpoint"= "endpont", "s3.region"= "region", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "csv");
-
其它支持的 URI 风格示例:
-- Virtual Host AWS Client (Hadoop S3) Mixed Style。通过设置 `use_path_style = false` 以及 `force_parsing_by_standard_uri = true` 来使用。 select * from s3( "URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint"= "endpont", "s3.region"= "region", "format" = "csv", "use_path_style"="false", "force_parsing_by_standard_uri"="true"); -- Path AWS Client (Hadoop S3) Mixed Style。通过设置 `use_path_style = true` 以及 `force_parsing_by_standard_uri = true` 来使用。 select * from s3( "URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint"= "endpont", "s3.region"= "region", "format" = "csv", "use_path_style"="true", "force_parsing_by_standard_uri"="true");
-
CSV format
由于 S3 table-valued-function 事先并不知道 Table Schema,所以会先读一遍文件来解析出 Table Schema。
csv
格式:S3 table-valued-function 读取 S3 上的文件并当作 CSV 文件来处理,读取文件中的第一行用于解析 Table Schema。文件第一行的列个数n
将作为 Table Schema 的列个数,Table Schema 的列名则自动取名为c1, c2, ..., cn
,列类型都设置为String
, 举例:student1.csv 文件内容为:
1,ftw,12 2,zs,18 3,ww,20
-
使用 S3 TVF
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true") order by c1;
+------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
-
可以配合
desc function S3()
来查看 Table SchemaDesc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true");
+-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | c1 | TEXT | Yes | false | NULL | NONE | | c2 | TEXT | Yes | false | NULL | NONE | | c3 | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
-
csv_with_names format
csv_with_names
格式:解析文件的第一行作为 Table Schema 的列个数和列名,列类型则都设置为String
, 举例:student_with_names.csv 文件内容为
id,name,age 1,ftw,12 2,zs,18 3,ww,20
-
使用 S3 tvf
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names", "use_path_style" = "true") order by id;
+------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
-
同样配合
desc function S3()
可查看 Table SchemaDesc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names", "use_path_style" = "true");
+-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | id | TEXT | Yes | false | NULL | NONE | | name | TEXT | Yes | false | NULL | NONE | | age | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
-
csv_with_names_and_types format
csv_with_names_and_types
格式:目前暂不支持从 CSV 文件中解析出 Column Type。使用该 Format 时,S3 TVF 会解析文件的第一行作为 Table Schema 的列个数和列名,列类型则都设置为 String,同时将忽略该文件的第二行。student_with_names_and_types.csv
文件内容为id,name,age INT,STRING,INT 1,ftw,12 2,zs,18 3,ww,20
-
使用 S3 TVF
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names_and_types", "use_path_style" = "true") order by id;
+------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
-
同样配合
desc function S3()
可查看 Table SchemaDesc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names_and_types", "use_path_style" = "true");
+-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | id | TEXT | Yes | false | NULL | NONE | | name | TEXT | Yes | false | NULL | NONE | | age | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
-
JSON format
json
格式:JSON 格式涉及到较多的可选参数,各个参数的意义可以参考:Json Load。S3 TVF 查询 JSON 格式文件时根据json_root
和jsonpaths
参数定位到一个 JSON 对象,将该对象的中的key
作为 Table Schema 的列名,列类型都设置为 String。举例:data.json 文件
[{"id":1, "name":"ftw", "age":18}] [{"id":2, "name":"xxx", "age":17}] [{"id":3, "name":"yyy", "age":19}]
-
使用 S3 TVF 查询
select * from s3( "uri" = "http://127.0.0.1:9312/test2/data.json", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "json", "strip_outer_array" = "true", "read_json_by_line" = "true", "use_path_style"="true"); +------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 18 | | 2 | xxx | 17 | | 3 | yyy | 19 | +------+------+------+ select * from s3( "uri" = "http://127.0.0.1:9312/test2/data.json", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "json", "strip_outer_array" = "true", "jsonpaths" = "[\"$.id\", \"$.age\"]", "use_path_style"="true"); +------+------+ | id | age | +------+------+ | 1 | 18 | | 2 | 17 | | 3 | 19 | +------+------+
-
Parquet format
parquet
格式:S3 TVF 支持从 Parquet 文件中解析出 Table Schema 的列名、列类型。举例:select * from s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "parquet", "use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi | | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo | | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag | | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r | | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
MySQL [(none)]> desc function s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "parquet", "use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+--------------+------+-------+---------+-------+ | p_partkey | INT | Yes | false | NULL | NONE | | p_name | TEXT | Yes | false | NULL | NONE | | p_mfgr | TEXT | Yes | false | NULL | NONE | | p_brand | TEXT | Yes | false | NULL | NONE | | p_type | TEXT | Yes | false | NULL | NONE | | p_size | INT | Yes | false | NULL | NONE | | p_container | TEXT | Yes | false | NULL | NONE | | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE | | p_comment | TEXT | Yes | false | NULL | NONE | +---------------+--------------+------+-------+---------+-------+
-
orc format
orc
格式:和parquet
format 使用方法一致,将format
参数设置为orc
。select * from s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "orc", "use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi | | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo | | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag | | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r | | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
-
avro format
avro
格式:S3 TVF 支持从 avro 文件中解析出 Table Schema 的列名、列类型。举例:select * from s3( "uri" = "http://127.0.0.1:9312/test2/person.avro", "ACCESS_KEY" = "ak", "SECRET_KEY" = "sk", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+ | name | boolean_type | double_type | long_type | +--------+--------------+-------------+-----------------+ | Alyssa | 1 | 10.0012 | 100000000221133 | | Ben | 0 | 5555.999 | 4009990000 | | lisi | 0 | 5992225.999 | 9099933330 | +--------+--------------+-------------+-----------------+
-
URI 包含通配符
URI 可以使用通配符来读取多个文件。注意:如果使用通配符要保证各个文件的格式是一致的 (尤其是
csv
/csv_with_names
/csv_with_names_and_types
算做不同的格式),S3 TVF 用第一个文件来解析出 Table Schema。 如下两个 CSV 文件:// file1.csv 1,aaa,18 2,qqq,20 3,qwe,19 // file2.csv 5,cyx,19 6,ftw,21
-
可以在 URI 上使用通配符来导入。
select * from s3( "uri" = "http://127.0.0.1:9312/test2/file*.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style"="true");
+------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | aaa | 18 | | 2 | qqq | 20 | | 3 | qwe | 19 | | 5 | cyx | 19 | | 6 | ftw | 21 | +------+------+------+
-
配合
insert into
和cast
使用S3
TVF-- 创建 Doris 内部表 CREATE TABLE IF NOT EXISTS ${testTable} ( id int, name varchar(50), age int ) COMMENT "my first table" DISTRIBUTED BY HASH(id) BUCKETS 32 PROPERTIES("replication_num" = "1"); -- 使用 S3 插入数据 insert into ${testTable} (id,name,age) select cast (id as INT) as id, name, cast (age as INT) as age from s3( "uri" = "${uri}", "s3.access_key"= "${ak}", "s3.secret_key" = "${sk}", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "${format}", "strip_outer_array" = "true", "read_json_by_line" = "true", "use_path_style" = "true");