最近BA用户反馈有两句看似很像的语句返回的结果数不一样,比较奇怪,怀疑是不是Hive的Bug
Query 1 返回结果数6071
select count(distinct reviewid) as dis_reviewcnt from (select a.reviewid from bi.dpods_dp_reviewreport a left outer join bi.dpods_dp_reviewlog b on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24' where to_date(a.feedadddate) >= '2013-07-01' and a.hp_statdate='2013-07-24' ) a
Query 2 返回结果数6443
select count(distinct reviewid) as dis_reviewcnt from (select a.reviewid from bi.dpods_dp_reviewreport a left outer join bi.dpods_dp_reviewlog b on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24' and a.hp_statdate='2013-07-24' where to_date(a.feedadddate) >= '2013-07-01' ) a
第二条query比第一条多了372条数据,而且在子查询的左表中并不存在
两条语句唯一的区别是dpods_dp_reviewreport的分区过滤条件(hp_statdate是partition column)一个在where后面,另一个在on后面
粗看感觉出来的数据应该是一样的,但是玄机其实就在where和on的区别。
where 后面跟的是过滤条件,query 1 中的a.hp_statdate='2013-07-24', 在table scan之前就会Partition Pruner 过滤分区,所以只有'2013-07-24'下的数据会和dpods_dp_reviewlog进行join。
而query 2中会读入所有partition下的数据,再和dpods_dp_reviewlog join,并且根据join的关联条件只有a.hp_statdate='2013-07-24'的时候才会真正执行join,其余情况下又由于是left outer join, join不上右面会留NULL,query 2中其实是取出了所有的reviewid,所以会和query 1 结果不一样
可以做一个实验,query2去掉on后面的a.hp_statdate='2013-07-24',其余不动,执行语句,出来的distinct reviewcnt 也是 6443
select count(distinct reviewid) as dis_reviewcnt from (select a.reviewid from bi.dpods_dp_reviewreport a left outer join bi.dpods_dp_reviewlog b on a.reviewid=b.reviewid and b.hp_statdate='2013-07-24' where to_date(a.feedadddate) >= '2013-07-01' ) a
更多精彩内容:http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/
query 1的query plan
ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewreport) a) (TOK_TABREF (TOK_TABNAME bi dpods_dp_reviewlog) b) (and (= (. (TOK_TABLE_OR_COL a) reviewid) (. (TOK_TABLE_OR_COL b) reviewid)) (= (. (TOK_TABLE_OR_COL b) hp_statdate) '2013-07-24')))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) reviewid))) (TOK_WHERE (and (>= (TOK_FUNCTION to_date (. (TOK_TABLE_OR_COL a) feedadddate)) '2013-07-01') (= (. (TOK_TABLE_OR_COL a) hp_statdate) '2013-07-24'))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL reviewid)) dis_reviewcnt)))) STAGE DEPENDENCIES: Stage-5 is a root stage , consists of Stage-1 Stage-1 Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Conditional Operator Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Filter Operator predicate: expr: (to_date(feedadddate) >= '2013-07-01') type: boolean Reduce Output Operator key expressions: expr: reviewid type: int sort order: + Map-reduce partition columns: expr: reviewid type: int tag: 0 value expressions: expr: feedadddate type: string expr: reviewid type: int expr: hp_statdate type: string a:b TableScan alias: b Reduce Output Operator key expressions: expr: reviewid type: int sort order: + Map-reduce partition columns: expr: reviewid type: int tag: 1 Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col5} {VALUE._col8} {VALUE._col17} 1 handleSkewJoin: false outputColumnNames: _col5, _col8, _col17 Select Operator expressions: expr: _col8 type: int outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: int outputColumnNames: _col0 Group By Operator aggregations: expr: count(DISTINCT _col0) bucketGroup: false keys: expr: _col0 type: int mode: hash outputColumnNames: _col0, _col1 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: hdfs://10.2.6.102/tmp/hive-hadoop/hive_2013-07-26_18-10-59_408_7272696604651905662/-mr-10002 Reduce Output Operator key expressions: expr: _col0 type: int sort order: + tag: -1 value expressions: expr: _col1 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col0:0._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1
Query 2的query plan
以上是小编为您精心准备的的内容,在的博客、问答、公众号、人物、课程等栏目也有的相关内容,欢迎继续使用右上角搜索按钮进行搜索expr
, type
, operator
, 2013
, col
map tree
hive left outer join、hive full outer join、hive outer join、left outer join、left outer join 用法,以便于您获取更多的相关知识。