[Hadoop大数据]——Hive数据的导入导出

Hive作为大数据环境下的数据仓库工具，支持基于hadoop以sql的方式执行mapreduce的任务，非常适合对大量的数据进行全量的查询分析。

本文主要讲述下hive载cli中如何导入导出数据：

导入数据

第一种方式，直接从本地文件系统导入数据

我的本机有一个test1.txt文件，这个文件中有三列数据，并且每列都是以'\t'为分隔

[root@localhost conf]# cat /usr/tmp/test1.txt
1   a1  b1
2   a2  b2
3   a3  b3
4   a4  b

创建数据表：

>create table test1(a string,b string,c string)
>row format delimited
>fields terminated by '\t'
>stored as textfile;

导入数据：

load data local inpath '/usr/tmp/test1.txt' overwrite into table test1;

其中local inpath，表明路径为本机路径
overwrite表示加载的数据会覆盖原来的内容

第二种，从hdfs文件中导入数据

首先上传数据到hdfs中

hadoop fs -put /usr/tmp/test1.txt /test1.txt

在hive中查看test1.txt文件

hive> dfs -cat /test1.txt;
1   a1  b1
2   a2  b2
3   a3  b3
4   a4  b4

创建数据表，与前面一样。导入数据的命令有些差异:

load data inpath '/test1.txt' overwrite into table test2;

第三种，基于查询insert into导入

首先定义数据表，这里直接创建带有分区的表

hive> create table test3(a string,b string,c string) partitioned by (d string) row format delimited fields terminated by '\t' stored as textfile;
OK
Time taken: 0.109 seconds
hive> describe test3;
OK
a                       string
b                       string
c                       string
d                       string                                      

# Partition Information
# col_name              data_type               comment             

d                       string
Time taken: 0.071 seconds, Fetched: 9 row(s)

通过查询直接导入数据到固定的分区表中：

hive> insert into table test3 partition(d='aaaaaa') select * from test2;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160823212718_9cfdbea4-42fa-4267-ac46-9ac2c357f944
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-23 21:27:21,621 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local1550375778_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test3/d=aaaaaa/.hive-staging_hive_2016-08-23_21-27-18_739_4058721562930266873-1/-ext-10000
Loading data to table test.test3 partition (d=aaaaaa)
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 248 HDFS Write: 175 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 3.647 seconds

通过查询观察结果

hive> select * from test3;
OK
1   a1  b1  aaaaaa
2   a2  b2  aaaaaa
3   a3  b3  aaaaaa
4   a4  b4  aaaaaa
Time taken: 0.264 seconds, Fetched: 4 row(s)

PS:也可以直接通过动态分区插入数据:

insert into table test4 partition(c) select * from test2;

分区会以文件夹命名的方式存储：

hive> dfs -ls /user/hive/warehouse/test.db/test4/;
Found 4 items
drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b1
drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b2
drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b3
drwxr-xr-x   - root supergroup          0 2016-08-23 21:33 /user/hive/warehouse/test.db/test4/c=b4

第四种，直接基于查询创建数据表

直接通过查询创建数据表:

hive> create table test5 as select * from test4;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160823213944_03672168-bc56-43d7-aefb-cac03a6558bf
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-23 21:39:46,030 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local855333165_0003
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/.hive-staging_hive_2016-08-23_21-39-44_259_5484795730585321098-1/-ext-10002
Moving data to directory hdfs://localhost:8020/user/hive/warehouse/test.db/test5
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 600 HDFS Write: 466 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 2.184 seconds

查看结果

hive> select * from test5;
OK
1   a1  b1
2   a2  b2
3   a3  b3
4   a4  b4
Time taken: 0.147 seconds, Fetched: 4 row(s)

导出数据

导出到本地文件

执行导出本地文件命令:

hive> insert overwrite local directory '/usr/tmp/export' select * from test1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160823221655_05b05863-6273-4bdd-aad2-e80d4982425d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-23 22:16:57,028 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local8632460_0005
Moving data to local directory /usr/tmp/export
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 794 HDFS Write: 498 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.569 seconds
hive>

在本地文件查看内容:

[root@localhost export]# ll
total 4
-rw-r--r--. 1 root root 32 Aug 23 22:16 000000_0
[root@localhost export]# cat 000000_0
1a1b1
2a2b2
3a3b3
4a4b4
[root@localhost export]# pwd
/usr/tmp/export
[root@localhost export]#

导出到hdfs

hive> insert overwrite directory '/usr/tmp/test' select * from test1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20160823214217_e8c71bb9-a147-4518-8353-81f9adc54183
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
2016-08-23 21:42:19,257 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_local628523792_0004
Stage-3 is selected by condition resolver.
Stage-2 is filtered out by condition resolver.
Stage-4 is filtered out by condition resolver.
Moving data to directory hdfs://localhost:8020/usr/tmp/test/.hive-staging_hive_2016-08-23_21-42-17_778_6818164305996247644-1/-ext-10000
Moving data to directory /usr/tmp/test
MapReduce Jobs Launched:
Stage-Stage-1:  HDFS Read: 730 HDFS Write: 498 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 1.594 seconds

导出成功，查看导出的hdfs文件

hive> dfs -cat /usr/tmp/test;
cat: `/usr/tmp/test': Is a directory
Command failed with exit code = 1
Query returned non-zero code: 1, cause: null
hive> dfs -ls /usr/tmp/test;
Found 1 items
-rwxr-xr-x   3 root supergroup         32 2016-08-23 21:42 /usr/tmp/test/000000_0

hive> dfs -cat /usr/tmp/test/000000_0;
1a1b1
2a2b2
3a3b3
4a4b4
hive>

导出到另一个表

样例可以参考前面数据导入的部分:

insert into table test3 select * from test1;

本文转自博客园xingoo的博客，原文链接：[Hadoop大数据]——Hive数据的导入导出，如需转载请自行联系原博主。

时间： 2024-12-08 12:20:44

[Hadoop大数据]——Hive数据的导入导出的相关文章

临时数据文件 offline 对于导入导出的影响

临时数据文件 offline 对于导入导出的影响 sys@ORACL> alter database tempfile 'd:\oracle\oradata\oracl\temp01.dbf' offline; 数据库已更改. sys@ORACL> ================================================tempfile offline的情况. 1 导出少量数据时,没有报错,当导出大量数据时,会报EXP-00068: 表空间 TEMP 脱机 C:\Use

php mysql数据的导入导出,数据表结构的导入导出

实现数据的导入导出,数据表结构的导入导出 ********************************************************/ // //包含Mysql数据库操作文件 // require_once("MysqlDB.php"); /******************************************************* **类名:MysqlDB

excel 导入导出图片，图片失真！

问题描述我想实现一个数据通过excel形式导入导出的功能,数据中包含有图片,在使用javapoi实现后,发现将图片导入到excel后,再导出,会发生图片失真的情况.(图片文件大小和图片效果均有变化)求大侠们给予指教!注:我在excel的设置中已经关闭了图片自动压缩的功能.

[Hadoop大数据]——Hive连接JOIN用例详解

SQL里面通常都会用Join来连接两个表,做复杂的关联查询.比如用户表和订单表,能通过join得到某个用户购买的产品:或者某个产品被购买的人群.... Hive也支持这样的操作,而且由于Hive底层运行在hadoop上,因此有很多地方可以进行优化.比如小表到大表的连接操作.小表进行缓存.大表进行避免缓存等等... 下面就来看看hive里面的连接操作吧!其实跟SQL还是差不多的... 数据准备:创建数据-->创建表-->导入数据首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id.第

操作几万条，甚至几十万条数据导入导出用什么方式比较好？除了数据库外，用NPOI好像不行，有没有大神操作过几十万条EXCEL数据的？

问题描述操作几万条,甚至几十万条数据导入导出用什么方式比较好?(这里指的导入导出是数据导入到程序里,进行一些修改操作,然后再导出目前操作5000条数据是没问题的,但是超过1W条就报错了,)报了一个这样的错误:其他信息:Exception:WrongLocalheadersignature:0x5757575A我觉得应该是长度受限制了,但是又没有什么好的方法解决除了数据库外,用NPOI好像不行,有没有大神操作过几十万条EXCEL数据的?是怎么解决的求指教解决方案解决方案二:没人给回复呀?解决

[Hadoop大数据]——Hive初识

Hive出现的背景 Hadoop提供了大数据的通用解决方案,比如存储提供了Hdfs,计算提供了MapReduce思想.但是想要写出MapReduce算法还是比较繁琐的,对于开发者来说,需要了解底层的hadoop api.如果不是开发者想要使用mapreduce就会很困难.... 另一方面,大部分的开发者都有使用SQL的经验.SQL成为开发者必备的技能... 那么可以不可以使用SQL来完成MapReduce的过程呢?-- 答案就是,Hive Hive能够解决的问题 Hive可以帮助开发者从现有的数

[Hadoop大数据]——Hive部署入门教程

Hive是为了解决hadoop中mapreduce编写困难,提供给熟悉sql的人使用的.只要你对SQL有一定的了解,就能通过Hive写出mapreduce的程序,而不需要去学习hadoop中的api. 在部署前需要确认安装jdk以及Hadoop 如果需要安装jdk以及hadoop可以参考我之前的博客: Linux下安装jdkLinux下安装hadoop伪分布式在安装之前,先了解下Hive都有哪些东西. 下载并解压缩去主页选择镜像地址: http://www.apache.org/dyn/cl

phpadmin如何导入导出大数据文件及php.ini参数修改_php技巧

最近遇到了数据库过大的时候用phpadmin导入的问题,新版本的phpadmin导入限定是8M,老版本的可能2M,我的数据库有几十兆这可怎么办呢? 首先如果你有独立服务器或vps的话可以找到 Apache 下的php.ini 这个文件来修改这个8M或2M的限制,怎么修改呢? 搜索到,修改这三个复制代码代码如下: upload_max_filesize = 2M post_max_size = 8M memory_limit = 128M 修改完毕重启下服务,进phpadmin看看吧,应该可

java-如何把数据的直接输入输出的方式变成文件的导入导出方式

问题描述如何把数据的直接输入输出的方式变成文件的导入导出方式求哪位大神解下下面的程序应该怎么写代码如何把数据的直接输入输出的方式变成文件的导入导出方式在下面题中1.输入10个学生5门课成绩,分别用函数求:1)每个学生平均分:2)每门课的平均分:3)找出最高的分数所对应的学生和课程:4)求平均分方差: ,xi为某一学生的平均分. 解决方案 Java中的应该是这样写的:先导入包:import java.util.Scanner;下一步创建扫描仪:Scanner input=new Scanne