本文共 1292 字,大约阅读时间需要 4 分钟。
Hive中的每一个表,每一个分区都可以进行分桶,表或者分区实际上是以文件的形式在hdfs上存储,而分桶物理上相当于将一个文件分成几个文件进行存储,分桶用于大规模数据集。
create table student_bucket(id INT, name STRING, age INT)clustered by (age) into 4 bucketsROW FROMAT DELIMITED FIELDS TERMINATED BY ',';
若需要排序可以用如下建表语句:
CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING ) PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS SEQUENCEFILE;
通过viewTime对每个桶的数据排序
set hive.enforce.bucketing=true;
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
分桶及排序不影响数据的插入方式,只影响读取方式。分桶数量与reduce Task数量一致,在查询的sql使用cluster by和sort by.
insert into student_bucket select id,name,age from student cluster by (id);
1.数据抽样分析
2.使用分桶能提高join效率,要求两个桶表字段和数量一致
select a.id, a.age,b.name from a join b on a.id = b.id
如果a,b表都是分桶表且分桶字段一致,则不需要进行全表笛卡尔积,因为一个id会被分到相同的桶中。
转载地址:http://efkai.baihongyu.com/