Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE

My name is Farooq and I am with HDinsight support team here at Microsoft. In this blog I will try to give some brief overview of Sqoop in HDinsight and then use an example of importing data from a Windows Azure SQL Database table to HDInsight cluster to demonstrate how you can get stated with Sqoop in HDInsight.

What is Sqoop?

Sqoop is an Apache project and part of Hadoop ecosystem. It allows data transfer between Hadoop\HDInsight cluster and relational databases such as SQL, Oracle, MySQL etc. Sqoop is a collection of related tools, for example import, export, list-all-tables, list-databases etc. To use Sqoop, you specify the tool you want to use and the arguments that control the tool. For more information on Sqoop please check Sqoop User Guide.

When do you need to use Sqoop?

You need to use Sqoop only when you are trying to import/export data between Hadoop and a relational Database. HDInsight provides a full-featured Hadoop Distributed File System (HDFS) over Windows Azure Blob storage (WABS) and if you want to upload data to HDInsight or WASB from any other source, for example from your local computer's file system then you should use any of the tools discussed in this article. The same article also discusses how to import data to HDFS from SQL Database/SQL Server using Sqoop. In this blog I will elaborate on the same with an example and try to provide more details information along the way.

 

What do I need to do for Sqoop to work in my HDInsight cluster?

HDInsight 2.1 includes Sqoop 1.4.3. The Microsoft SQL Server SQOOP Connector for Hadoop is now part of Apache SQOOP 1.4. So you do not need to install the connector separately. All HDInsight clusters also have Microsoft SQL Server JDBC driver installed; so all components that are needed to transfer data between HDInsight cluster and SQL server are already installed in a HDI cluster and you do not have to install anything.

How can I run a Sqoop job?

With HDInsight preview version we could only run the Sqoop commands from Hadoop command line after doing a remote desktop session (RDP) on the HDInsight cluster head node. However the release version of HDInsight SDK includes the PowerShell cmdlet to run Sqoop job remotely. So we can

  1. Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line
  2. Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

 

We recommend that you run your Sqoop commands remotely using HDInsight SDK cmdlets . We will discuss both the options in detail. First let's see how we can run Sqoop jobs locally from HDInsight head node using Hadoop Command Line.

 

Run Sqoop jobs locally from HDInsight head node using Hadoop Command Line

I am assuming you already have a Windows Azure SQL Database. If you don't and you want to get one please follow the steps in this article. Let's follow the steps below to create a test table and populate with some sample data in your Windows Azure SQL Database which we will import in our HDInsight cluster shortly. I will show how to do this from Windows Azure portal but you can also connect to the Windows Azure SQL Database from SSMS and do the same.

Note: if you want to transfer data from a SQL server on your environment instead then you need to change the Sqoop command with appropriate connection information and it should be very similar to the connection string I have provided later in this blog under 'More sample Sqoop commands' section for SQL server on Window Azure VM.

  1. Login to your Windows Azure Portal and select 'SQL Databases' from the Left and click 'Manage' at the bottom.

  2. Provider your Windows Azure SQL Database user ID and password to login and then click 'New Query' to open a new query window to run T-SQL queries.

  3. Copy paste the following T-SQL query and execute to create a test table Table1.

    CREATE TABLE [dbo].[Table1](

        [ID] [int] NOT NULL,

        [FName] [nvarchar](50) NOT NULL,

        [LName] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_Table_4] PRIMARY KEY CLUSTERED

    (

        [ID] ASC

    )

    ) ON [PRIMARY]

    GO

  4. Run the Following to Populate Table1 with 4 rows.

    INSERT INTO [dbo].[Table1] VALUES (1,'Jhon','Doe'), (2,'Harry','Hoe'), (3, 'Carla','Coe'), (4,'Jackie','Joe');

    GO

  5. Now finally run the following T-SQL to make sure that is table is populated with the sample data. You should see the output as below.

    SELECT * from [dbo].[Table1]

 

 

Now let's follow the steps below to Import the rows in Table1 to the HDInsight Cluster.

  1. Login to your HDInsight cluster head node via Remote Desktop (RDP) and double click the 'Hadoop Command Line' icon in the desktop to open Hadoop Command Line. RDP access is turned off by default but you can follow the steps inthis blog to enable RDP and then RDP to the head node of your HDInsight cluster.
  2. In Hadoop Command Line please navigate to the "C:\apps\dist\sqoop-1.4.3.1.3.1.0-06\bin" folder.

    Note: Please verify the path for the Sqoop bin folder in your environment. It may slightly vary from version to version.
  3. Run the following Sqoop command to import all the rows of table "Table1" from  Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.

    sqoop.cmd import –-connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table1 --target-dir /user/hdp/SqoopImportTable1

    Once the command is executed successfully you should see something similar as below in Hadoop Command Line window.

  4. There are quite a number of tools available to upload/download and view data in WASB. Let's use Azure Storage Explorer tool. You need to install the tool in your work station and configure for your cluster. Once all is done open the tool and find out /user/hdp/SqoopImportTable1 folder. You should see something similar as below. It shows 4 files indicating 4 map jobs were used. You can select a file and click the 'View' button to see the actual text data.

Now let's export the same rows back to the SQL server from HDInsight cluster. Please use a different table with the same schema as 'Table1'. Otherwise you would get a Primary Key violation error since the rows already exist in 'Table1'.

  1. Create an empty table 'Table2' with the same schema as 'Table1'.

    CREATE TABLE [dbo].[Table2](

        [ID] [int] NOT NULL,

        [FName] [nvarchar](50) NOT NULL,

        [LName] [nvarchar](50) NOT NULL,

    CONSTRAINT [PK_Table_2] PRIMARY KEY CLUSTERED

    (

        [ID] ASC

    )

    ) ON [PRIMARY]

    GO

  2. Run the following Sqoop command from Hadoop Command Line.

sqoop.cmd export --connect "jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>;password=<SQLDatabasePassword>;database=<SQLDatabaseDatabaseName>" --table Table2 --export-dir /user/hdp/SqoopImportTable1 --input-fields-terminated-by ","

More sample Sqoop commands:

Import from a SQL server on Window Azure VM:

sqoop.cmd import --connect "jdbc:sqlserver:// <WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_1 --target-dir /user/hdp/SqoopImportTable

Export to a SQL server on Window Azure VM:

sqoop.cmd export --connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table_2 --export-dir /user/hdp/SqoopImportTable2 --input-fields-terminated-by ","

Importing to HIVE from Windows Azure SQL Database:

C:\apps\dist\sqoop-1.4.2\bin>sqoop.cmd import –connect "jdbc:sqlserver://<WindowsAzureVMServerName>.cloudapp.net:1433; username=<SQLServerUserName>; password=<SQLServerPassword>; database=<SQLServerDatabaseName>" --table Table1 --hive-import

Note: This will store the files under hive/warehouse/TableName folder in HDFS (For example hive/warehouse/table1/part-m-00000 )

Run Sqoop job remotely using HDInsight SDK PowerShell cmlets

To use HDInsight PowerShell tools you need to install Windows Azure PowerShell tools first and then install HDInsight PowerShell tools. Then you need to prepare your workstation to use the HDInsight SDK. Please follow the detail steps in this earlier blog post to install the tools and prepare your work station to use the HDInsight SDK.

Once you have installed and configured Windows Azure PowerShell tools and HDInsight SDK running a Sqoop job is very easy. Please follow the steps below to import all the rows of table "Table2" from Windows Azure SQL Database "mfarooqSQLDB" to HDInsight Cluster.

  1. Open the Windows azure PowerShell console on the workstation and run the following cmdlets one at a time.

    Note: You can also use Windows Powershell ISE to type the code and run all at once. Powershell ISE makes edits easy and you can open the tool from "C:\Windows\System32\WindowsPowerShell\v1.0\powershell_ise.exe".

  2. Set the variables for your Windows Azure Subscription name and the HDInsight cluster name.

    $subscriptionName = "<WindowsAzureSubscriptionName>"

    $clusterName = "<HDInsightClusterName>"

    Select-AzureSubscription $subscriptionName

    Use-AzureHDInsightCluster $clusterName -Subscription $subscriptionName

  3. Define the Sqoop job that we want to run. In this exercise we will import all the rows of table "Table2" that we created earlier in Windows Azure SQL Database.

    $sqoop = New-AzureHDInsightSqoopJobDefinition -Command "import --connect jdbc:sqlserver://<SQLDatabaseServerName>.database.windows.net:1433;username=<SQLDatabasUsername>@<SQLDatabaseServerName>; password=<SQLDatabasePassword>; database=<SQLDatabaseDatabaseName> --table Table2 --target-dir /user/hdp/SqoopImportTable8"

  4. Run the Sqoop job that we just defined.

    $sqoopJob = Start-AzureHDInsightJob -Subscription $subscriptionName -Cluster $clusterName -JobDefinition $sqoop

  5. Run the following to wait for the completion or failure of the HDInsight job and show its progress.

    Wait-AzureHDInsightJob -Subscription $subscriptionName -WaitTimeoutInSeconds 3600 -Job $sqoopJob

  6. Run the following to retrieve the log output for a job from the storage account associated with a specified cluster.

    Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subscriptionName -StandardError -JobId $sqoopJob.JobId

If the Sqoop job completes successfully you should see something similar as below in your Windows Azure PowerShell command line window.

Troubleshooting tips

When you run a Sqoop job command it runs MapReduce job in Hadoop Cluster (map only and no reduce task). You can specify the number of map tasks but by default four tasks are used. There is no separate log file specific to Sqoop. So we need to troubleshoot Sqoop job failure or performance issues as any other MapReduce job failure or performance issues and start by checking the task logs.  I plan to write more on how to troubleshot Sqoop issues by focusing on some specific scenarios in the near future.

That's all for today and I hope you found this blog useful. I look forward to your comments and suggestions J.

时间: 2024-09-20 06:13:49

Azure 云平台用 SQOOP 将 SQL server 2012 数据表导入 HIVE / HBASE的相关文章

SQL Server 2012 多表连接查询功能实例代码

废话不多说了,直接给大家贴代码了,具体代码如下所示: -- 交叉连接产生笛卡尔值 (X*Y) SELECT * FROM Student cross Join dbo.ClassInfo --另外一种写法 SELECT * FROM Student , ClassInfo -- 内连接 (Inner 可以省略) SELECT * FROM Student JOIN dbo.ClassInfo ON dbo.Student.Class = dbo.ClassInfo.ID; -- Inner Jo

sqoop将oracle数据表导入hive中文乱码问题

问题描述 sqoop将oracle数据表导入hive中文乱码问题 请教各位大神一个问题,就是将oracle的表导入到hive后中文乱码,oracle库的编码格式为US7ASCII,各位大神有没有遇到过类型的问题,或者有没有好的解决方案建议,谢谢了.附注:现在已经试过convert(nsrdzdah,'utf8','US7ASCII'),但是还是乱码:还有就是修改hive jdbc jar包,感觉不靠谱就没有试 解决方案 此问题已经解决 参考 http://www.aboutyun.com/hom

SQL Server 2008数据表迁移至Postgres Plus详细步骤

一.概述 目前在市场上有许多数据库厂商,并且被许多数据密集型应用所采用,因此,许多时候人们需要移植应用程序以使用不同数据库中的数据,或者从不同数据库中迁移数据以供自己的应用程序之用.一般情况下,业内多采用数据迁移方式,因为这样做相对容易一些. Migration Studio是一款从诸如SQL Server.Oracle.MySQL等各种数据库向http://www.aliyun.com/zixun/aggregation/14171.html">Postgres自动迁移数据和业务逻辑的工

关于SQL Server打开数据表中的XML内容时报错的解决办法

从SQL Server2005开始提供了一种新的数据类型XML type,它允许用户将数据以XML文件的格式直接存储到数据表中.结合在ASP.NET中使用Linq to Sql,我们可以非常方便地将XML文件存储到SQL Server数据库中.但是在默认情况下,如果你存储的XML文件比较大(超过2MB),在SQL Server管理器中不能直接点击查看XML内容而报以下错误提示: 如何解决该问题呢?很简单,其实错误提示中已经描述地很清楚了,在SQL Server Management Studio

初体验SQL Server 2012的Hadoop连接器

本文讲的是初体验SQL Server 2012的Hadoop连接器,电影<天下无贼>中一句经典的"21世纪什么最贵?人才!",体现了以人为本的价值观.而实际上,深处大数据时代的我们,是不是也应该幽默一回:"21世纪什么最值钱?数据!".对于企业而言,除了人才,数据也是最重要资产之一. "大"数据的价值 面对如此庞大的数据,企业该如何挖掘其中的商机呢?这里给出一些应用场景,简单梳理一下大数据的价值所在: ·在以用户为中心的SNS网络中,

微软携手Win Azure云平台

今日,微软宣布将支持开源Node.js服务器端和JavaScript的开发环境,并将其看作是Azure云计算平台的一项很革命技术.微软自加入Node社区也有半年时间了,加盟的根本原因在于,希望Node能够更好的运行在Windows平台上,以便于用户更好的体验. 当一种思想的时代来临时,世界上没有比这更强大的力量.对于开源这块巧克力来说,云计算就是花生酱.世界在变,微软也有所转型,闭源微软开始向开源抛出了橄榄枝,近期在 Windows Azure 云计算系统中加入了两款开源平台,并向 3 个知名的

微软公布Azure云平台运行Oracle软件价格

[天极网服务器频道2月13日消息]从3月12日开始,微软将对在http://www.aliyun.com/zixun/aggregation/13357.html">Azure云平台上运行的Oracle软件进行收费,微软近日已经公布了相关的价目表. 在2012年,微软和Oracle达成合作关系.甲骨文的软件应用可以运行在Windows Server.Hyper-V和Windows Azure云平台上,在此合作框架签订之前,甲骨文的Weblogic和数据库只能运行在Windows Serve

用友采用windows azure云平台

在2014年用友用户大会上,用友与微软(中国)宣布将开展深入合作,利用由世纪互联运营的windowsazure云平台为用友企业管理与电子商务平台提供可靠的云平台支持. 用友公司高级副总裁郑雨林表示:"随着我国加快实施创新驱动发展战略,企业在信息化进程中会遇到技术.成本.管理等不同层面的挑战.用友借助云计算.大数据等先进技术,有针对性地推出了一系列互联网化的企业管理解决方案,基于世纪互联运营的windowsazure云平台为用友企业管理与电子商务平台提供了运营可靠.技术领先.面向未来的先进的云平台

《SQL Server企业级平台管理实践》读书笔记——SQL Server中数据文件空间使用与管理

原文:<SQL Server企业级平台管理实践>读书笔记--SQL Server中数据文件空间使用与管理 1.表和索引存储结构 在SQL Server2005以前,一个表格是以一个B树或者一个堆(heap)存放的.每个B树或者堆,在sysindexes里面都有一条记录相对应.SQL Server2005以后,引入了分区表的概念(Table Partition),在存储组织上,现有的分区基本上替代了原来表格的概念,原先表的概念成为了一个逻辑概念.一个分区就是一个B树或者一个堆.而一张表格则是一个