[Hadoop大数据]——Hive连接JOIN用例详解

SQL里面通常都会用Join来连接两个表，做复杂的关联查询。比如用户表和订单表，能通过join得到某个用户购买的产品；或者某个产品被购买的人群....

Hive也支持这样的操作，而且由于Hive底层运行在hadoop上，因此有很多地方可以进行优化。比如小表到大表的连接操作、小表进行缓存、大表进行避免缓存等等...

下面就来看看hive里面的连接操作吧！其实跟SQL还是差不多的...

数据准备：创建数据-->创建表-->导入数据

首先创建两个原始数据的文件，这两个文件分别有三列，第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果：

[xingoo@localhost tmp]$ cat aa.txt
1 a 3
2 b 4
3 c 1
[xingoo@localhost tmp]$ cat bb.txt
1 xxx 2
2 yyy 3
3 zzz 5

接下来创建两个表，需要注意的是表的字段分隔符为空格，另一个表可以直接基于当前的表创建。

hive> create table aa
    > (a string,b string,c string)
    > row format delimited
    > fields terminated by ' ';
OK
Time taken: 0.19 seconds
hive> create table bb like aa;
OK
Time taken: 0.188 seconds

查看两个表的结构：

hive> describe aa;
OK
a                       string
b                       string
c                       string
Time taken: 0.068 seconds, Fetched: 3 row(s)
hive> describe bb;
OK
a                       string
b                       string
c                       string
Time taken: 0.045 seconds, Fetched: 3 row(s)

下面可以基于本地的文件，导入数据

hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa;
Loading data to table test.aa
OK
Time taken: 0.519 seconds
hive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb;
Loading data to table test.bb
OK
Time taken: 0.321 seconds

内连接

内连接即基于on语句，仅列出表1和表2符合连接条件的数据。

hive> select * from aa a join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824161233_f9ecefa2-e5d7-416d-8d90-e191937e7313
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:12:44 Starting to launch local task to process map join;  maximum memory = 518979584
2016-08-24 16:12:47 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2016-08-24 16:12:47 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-12-33_145_337836390845333215-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (332 bytes)
2016-08-24 16:12:47 End of local task; Time Taken: 3.425 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:12:50,222 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local944389202_0007
MapReduce Jobs Launched:
Stage-Stage-3:  HDFS Read: 1264 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3   c   1   1   xxx 2
1   a   3   3   zzz 5
Time taken: 17.083 seconds, Fetched: 2 row(s)

左连接

左连接是显示左边的表的所有数据，如果有右边表与之对应，则显示；否则显示null

ive> select * from aa a left outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824161637_6d540592-13fd-4f59-a2cf-0a91c0fc9533
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:16:48 Starting to launch local task to process map join;  maximum memory = 518979584
2016-08-24 16:16:51 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
2016-08-24 16:16:51 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-16-37_813_4572869866822819707-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (338 bytes)
2016-08-24 16:16:51 End of local task; Time Taken: 2.634 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:16:53,843 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local1670258961_0008
MapReduce Jobs Launched:
Stage-Stage-3:  HDFS Read: 1282 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1   a   3   3   zzz 5
2   b   4   NULL    NULL    NULL
3   c   1   1   xxx 2
Time taken: 16.048 seconds, Fetched: 3 row(s)

右连接

类似左连接，同理。

hive> select * from aa a right outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162227_5d0f0090-1a9b-4a3f-9e82-e93c4d180f4b
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:22:37 Starting to launch local task to process map join;  maximum memory = 518979584
2016-08-24 16:22:40 Dump the side-table for tag: 0 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable
2016-08-24 16:22:40 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-22-27_619_7820027359528638029-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile20--.hashtable (332 bytes)
2016-08-24 16:22:40 End of local task; Time Taken: 2.368 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:22:43,060 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local2001415675_0009
MapReduce Jobs Launched:
Stage-Stage-3:  HDFS Read: 1306 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3   c   1   1   xxx 2
NULL    NULL    NULL    2   yyy 3
1   a   3   3   zzz 5
Time taken: 15.483 seconds, Fetched: 3 row(s)

全连接

相当于表1和表2的数据都显示，如果没有对应的数据，则显示Null.

hive> select * from aa a full outer join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162252_c71b2fae-9768-4b9a-b5ad-c06d7cdb60fb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2016-08-24 16:22:54,111 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_local1766586034_0010
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 4026 HDFS Write: 270 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3   c   1   1   xxx 2
NULL    NULL    NULL    2   yyy 3
1   a   3   3   zzz 5
2   b   4   NULL    NULL    NULL
Time taken: 1.689 seconds, Fetched: 4 row(s)

左半开连接

这个比较特殊，SEMI-JOIN仅仅会显示表1的数据，即左边表的数据。但是效率会比左连接快，因为他会先拿到表1的数据，然后在表2中查找，只要查找到结果立马就返回数据。

hive> select * from aa a left semi join bb b on a.c=b.a;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162327_e7fc72a7-ef91-4d39-83bc-ff8159ea8816
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:23:37 Starting to launch local task to process map join;  maximum memory = 518979584
2016-08-24 16:23:41 Dump the side-table for tag: 1 with group count: 3 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2016-08-24 16:23:41 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-23-27_008_3026796648107813784-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (317 bytes)
2016-08-24 16:23:41 End of local task; Time Taken: 3.586 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:23:43,798 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local521961878_0011
MapReduce Jobs Launched:
Stage-Stage-3:  HDFS Read: 1366 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1   a   3
3   c   1
Time taken: 16.811 seconds, Fetched: 2 row(s)

笛卡尔积

笛卡尔积会针对表1和表2的每条数据做连接...

hive> select * from aa join bb;
Warning: Map Join MAPJOIN[9][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160824162449_20e4b5ec-768f-48cf-a840-7d9ff360975f
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/apache-hive-2.1.0-bin/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hadoop/hadoop-2.6.4/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2016-08-24 16:25:00 Starting to launch local task to process map join;  maximum memory = 518979584
2016-08-24 16:25:02 Dump the side-table for tag: 0 with group count: 1 into file: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
2016-08-24 16:25:02 Uploaded 1 File to: file:/usr/hive/tmp/xingoo/a69078ea-b7d5-4a78-9342-05a1695e9f98/hive_2016-08-24_16-24-49_294_2706432574075169306-1/-local-10004/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (305 bytes)
2016-08-24 16:25:02 End of local task; Time Taken: 2.892 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-24 16:25:05,677 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_local2068422373_0012
MapReduce Jobs Launched:
Stage-Stage-3:  HDFS Read: 1390 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1   a   3   1   xxx 2
2   b   4   1   xxx 2
3   c   1   1   xxx 2
1   a   3   2   yyy 3
2   b   4   2   yyy 3
3   c   1   2   yyy 3
1   a   3   3   zzz 5
2   b   4   3   zzz 5
3   c   1   3   zzz 5

上面就是hive中的连接查询，其实与SQL一样的。

本文转自博客园xingoo的博客，原文链接：[Hadoop大数据]——Hive连接JOIN用例详解，如需转载请自行联系原博主。

时间： 2024-07-31 10:21:20

[Hadoop大数据]——Hive连接JOIN用例详解的相关文章

[Hadoop大数据]——Hive数据的导入导出

Hive作为大数据环境下的数据仓库工具,支持基于hadoop以sql的方式执行mapreduce的任务,非常适合对大量的数据进行全量的查询分析. 本文主要讲述下hive载cli中如何导入导出数据: 导入数据第一种方式,直接从本地文件系统导入数据我的本机有一个test1.txt文件,这个文件中有三列数据,并且每列都是以'\t'为分隔 [root@localhost conf]# cat /usr/tmp/test1.txt 1 a1 b1 2 a2 b2 3 a3 b3 4 a4 b 创建数据

[Hadoop大数据]——Hive初识

Hive出现的背景 Hadoop提供了大数据的通用解决方案,比如存储提供了Hdfs,计算提供了MapReduce思想.但是想要写出MapReduce算法还是比较繁琐的,对于开发者来说,需要了解底层的hadoop api.如果不是开发者想要使用mapreduce就会很困难.... 另一方面,大部分的开发者都有使用SQL的经验.SQL成为开发者必备的技能... 那么可以不可以使用SQL来完成MapReduce的过程呢?-- 答案就是,Hive Hive能够解决的问题 Hive可以帮助开发者从现有的数

[Hadoop大数据]——Hive部署入门教程

Hive是为了解决hadoop中mapreduce编写困难,提供给熟悉sql的人使用的.只要你对SQL有一定的了解,就能通过Hive写出mapreduce的程序,而不需要去学习hadoop中的api. 在部署前需要确认安装jdk以及Hadoop 如果需要安装jdk以及hadoop可以参考我之前的博客: Linux下安装jdkLinux下安装hadoop伪分布式在安装之前,先了解下Hive都有哪些东西. 下载并解压缩去主页选择镜像地址: http://www.apache.org/dyn/cl

大数据下高并发的处理详解

对于我们开发的网站,如果网站的访问量非常大的话,那么我们就需要考虑相关的并发访问问题了.而并发问题是绝大部分的程序员头疼的问题,但话又说回来了,既然逃避不掉,那我们就要想想应对措施,今天我们就一起讨论一下常见的并发和同步吧. 首先为了更好的理解并发和同步,我们需要首先明白两个重要的概念:同步和异步同步和异步的区别和联系所谓同步,就是一个线程执行一个方法或函数的时候,会阻塞其它线程,其他线程要等待它执行完毕才能继续执行.异步,就是多个线程之间没有阻塞,多个线程同时执行.通俗一点来说,同步就是一

MySQL大数据量之导入导出命令详解

面对大数据量,大文件的sql操作,我们需要借助mysql强大的命令操作: 1. 数据库导入命令代码如下复制代码 mysql -h localhost -u root -p use dbname source backup.sql 说明:需要使用命令连接上数据库并选择相应数据库才能使用. 2. 数据库备份命令 MySQL的导出命令mysqldump,基本用法是: 代码如下复制代码 mysqldump [OPTIONS] database [tables] 说明:不能先连接数据库,是直

除Hadoop大数据技术外，还需了解的九大技术

除Hadoop外的9个大数据技术: 1.Apache Flink 2.Apache Samza 3.Google Cloud Data Flow 4.StreamSets 5.Tensor Flow 6.Apache NiFi 7.Druid 8.LinkedIn WhereHows 9.Microsoft Cognitive Services Hadoop是大数据领域最流行的技术,但并非唯一.还有很多其他技术可用于解决大数据问题.除了Apache Hadoop外,另外9个大数据技术也是必须要了

Microsoft发布基于Azure之上Hadoop大数据服务第二预览版

Microsoft在最新发布的SQL Server 2012中,更新了基于Microsoft Azure之上的Hadoop.在上周Microsoft发布最新版本SQL Server的同时,Microsoft同时宣布其基于Windows Azure之上的Hadoop大数据服务第二个预览版.Micrsoft在SQL Server 2012中的许多新功能和新服务都是基于Microsoft客户的,这些客户专注于混合的IT环境,并将传统数据中心部署在私有云和公共云的环境之中. Microsoft的Hado

八大行业Hadoop大数据应用回顾和展望

任何新技术的发展都会经历一个从被公众了解到最终普遍应用的过程.大数据技术作为一个新兴的数据处理技术,经过了近十年的发展,刚刚开始在各个行业得到应用.但从媒体和公众视野中,大数据技术总是带有神秘的色彩,似乎有着挖掘财富和预测未来的神奇力量.广泛流传的大数据应用案例包括Target超市根据女孩的购物历史判断是否怀孕,信用卡公司根据用户在不同时空的购物行为预测客户的下一个购买行为,等等.大数据技术也为我们描绘了一个个如"智慧城市","智慧交通"和"智慧医疗&qu

HADOOP,大数据,c++开发环境搭建问题

问题描述 HADOOP,大数据,c++开发环境搭建问题各位大侠....我现在用c++来开发hadoop,现在服务环境已经搭建好了,我想再搭建一个用c++开发.编译hadoop的环境 c++的开发工具有eclipse和vs2010,请问各位大侠,我该怎么下手,怎么搭建解决方案 http://blog.csdn.net/jin123wang/article/details/39012255http://blog.csdn.net/zwx19921215/article/details/19896