简单的描述一下这些系统:
HBase – Key/Value的分布式数据库
Zookeeper – 支撑分布式应用的协作系统
Hive – SQL解析引擎
Flume – 分布式的日志收集系统
一、相关环境说明:
s1:
hadoop-master
namenode,jobtracker;
secondarynamenode;
datanode,taskTracker
s2:
hadoop-node-1
datanode,taskTracker;
s3:
hadoop-node-2
dataNode,taskTracker;
namenode – 整个HDFS的命名空间管理服务
secondarynamenode – 可以看做是namenode的冗余服务
jobtracker – 并行计算的job管理服务
datanode – HDFS的节点服务
tasktracker – 并行计算的job执行服务
二、前提系统环境配置:
1。添加hosts记录(所有机器)
hwl@hadoop-master:~$ cat /etc/hosts
192.168.242.128 hadoop-master
192.168.242.128 hadoop-secondary
192.168.242.129 hadoop-node-1
192.168.242.130 hadoop-node-2
2. 修改主机名
hwl@hadoop-master:~$ cat /etc/hostname
hadoop-master
hwl@hadoop-node-1:~$ cat /etc/hostname
hadoop-node-1
hwl@hadoop-node-2:~$ cat /etc/hostname
hadoop-node-2
3. 所有机器配置相互key免密钥(略)
三、Hadoop环境配置:
1. 选择安装包
为了更方便和更规范的部署Hadoop集群,我们采用Cloudera的集成包。
因为Cloudera对Hadoop相关的系统做了很多优化,避免了很多因各个系统间版本不符产生的很多Bug。
https://ccp.cloudera.com/display/DOC/Documentation//
2. 安装Java环境
由于整个Hadoop项目主要是通过Java开发完成的,因此需要JVM的支持。
添加匹配的Java版本的APT源
所有server上都安装:
apt-get install python-software-properties
vim /etc/apt/sources.list.d/sun-java-community-team-sun-java6-maverick.list
deb http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main
deb-src http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main
安装sun-java6-jdk
add-apt-repository ppa:sun-java-community-team/sun-java6
apt-get update
apt-get install sun-java6-jdk
3. 增加Cloudera的Hadoop安装源
vim /etc/apt/sources.list.d/cloudera.list
deb http://archive.cloudera.com/debian maverick-cdh3u3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3u3 contrib
apt-get install curl
curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
apt-get update
4. 安装Hadoop相关套件
hadoop-master上安装:
apt-get install hadoop-0.20-namenode
apt-get install hadoop-0.20-datanode
apt-get install hadoop-0.20-secondarynamenode
apt-get install hadoop-0.20-jobtracker
hadoop-node-1、hadoop-node-2上均安装:
apt-get install hadoop-0.20-datanode
apt-get install hadoop-0.20-tasktracker
5. 创建Hadoop配置文件
cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
6. 激活新的配置文件
update-alternatives –install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50 (优先级别配置)
查询当前配置:
update-alternatives –display hadoop-0.20-conf
7. 配置hadoop相关文件
7.1 所有server上配置java环境变量位置:
hwl@hadoop-master:~$ cat /etc/hadoop/conf/hadoop-env.sh
# Set Hadoop-specific environment variables here.
export JAVA_HOME=”/usr/lib/jvm/java-6-sun”
7.2 所有server上配置master、slave名称:
hwl@hadoop-master:~$ cat /etc/hadoop/conf/masters
hadoop-master
hwl@hadoop-master:~$ cat /etc/hadoop/conf/slaves
hadoop-node-1
hadoop-node-2
7.3 创建HDFS目录
mkdir -p /data/storage
mkdir -p /data/hdfs
chmod 700 /data/hdfs
chown -R hdfs:hadoop /data/hdfs
chmod 777 /data/storage
chmod o+t /data/storage
7.4 所有server配置core-site.xml
////
////
hadoop.tmp.dir
/data/storage
A directory for other temporary directories.
fs.default.name
hdfs://hadoop-master:8020
hadoop.tmp.dir指定了所有上传到Hadoop的文件的存放目录,所以要确保这个目录是足够大的。
fs.default.name指定NameNode的地址和端口号。
7.5 所有server配置hdfs-site.xml
////
////
dfs.name.dir
${hadoop.tmp.dir}/dfs/name dfs.data.dir
/data/hdfs dfs.replication
2 dfs.datanode.max.xcievers
4096 fs.checkpoint.period
300 fs.checkpoint.dir
${hadoop.tmp.dir}/dfs/namesecondary dfs.namenode.secondary.http-address
hadoop-secondary:50090
dfs.data.dir指定数据节点存放数据的位置。
dfs.replication指定每个Block需要备份的次数,起到冗余备份的作用,值必须小于DataNode的数目,否则会出错。
dfs.datanode.max.xcievers指定了HDFS Datanode同时处理文件的上限。
7.6 所有server配置mapred-site.xml
////
////
mapred.job.tracker
hdfs://hadoop-master:8021 mapred.system.dir
/mapred/system mapreduce.jobtracker.staging.root.dir
/user
mapred.job.tracker定位jobtracker的地址和端口。
mapred.system.dir定位存放在HDFS中的目录。
8. 格式化HDFS分布式文件系统
hwl@hadoop-master:~$ sudo -u hdfs hadoop namenode -format
[sudo] password for hwl:
14/05/11 19:18:31 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop-master/192.168.242.128
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2-cdh3u3
STARTUP_MSG: build = file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~maverick -r 318bc781117fa276ae81a3d111f5eeba0020634f; compiled by ‘root’ on Tue Mar 20 13:45:02 PDT 2012
************************************************************/
14/05/11 19:18:31 INFO util.GSet: VM type = 32-bit
14/05/11 19:18:31 INFO util.GSet: 2% max memory = 19.33375 MB
14/05/11 19:18:31 INFO util.GSet: capacity = 2^22 = 4194304 entries
14/05/11 19:18:31 INFO util.GSet: recommended=4194304, actual=4194304
14/05/11 19:18:32 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
14/05/11 19:18:32 INFO namenode.FSNamesystem: fsOwner=hdfs (auth:SIMPLE)
14/05/11 19:18:32 INFO namenode.FSNamesystem: supergroup=supergroup
14/05/11 19:18:32 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/05/11 19:18:32 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
14/05/11 19:18:32 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/05/11 19:18:32 INFO common.Storage: Image file of size 110 saved in 0 seconds.
14/05/11 19:18:32 INFO common.Storage: Storage directory /data/storage/dfs/name has been successfully formatted.
14/05/11 19:18:32 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/192.168.242.128
************************************************************/
9. 启动相关进程
9.1 master上:
hwl@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-datanode start
Starting Hadoop datanode daemon: datanode running as process 1218. Stop it first.
hadoop-0.20-datanode.
hwl@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-namenode start
Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-0.20-namenode.
hwl@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-jobtracker start (启动了两次才成功,第一次日志里面显示SHUTDOWN)
Starting Hadoop jobtracker daemon: starting jobtracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-jobtracker-hadoop-master.out
hadoop-0.20-jobtracker.
hwl@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-secondarynamenode start
Starting Hadoop secondarynamenode daemon: secondarynamenode running as process 1586. Stop it first.
hadoop-0.20-secondarynamenode.
hwl@hadoop-master:~$ sudo netstat -tnpl
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 838/sshd
tcp6 0 0 :::38197 :::* LISTEN 1589/java
tcp6 0 0 :::50070 :::* LISTEN 2070/java
tcp6 0 0 :::22 :::* LISTEN 838/sshd
tcp6 0 0 :::50010 :::* LISTEN 1274/java
tcp6 0 0 :::50075 :::* LISTEN 1274/java
tcp6 0 0 :::50020 :::* LISTEN 1274/java
tcp6 0 0 :::50090 :::* LISTEN 1589/java
tcp6 0 0 :::45579 :::* LISTEN 2070/java
tcp6 0 0 :::36590 :::* LISTEN 1274/java
tcp6 0 0 192.168.242.128:8020 :::* LISTEN 2070/java
hwl@hadoop-master:~$ sudo jps
2070 NameNode
3117 Jps
1589 SecondaryNameNode
1274 DataNode
3061 JobTracker
9.2 node上:
hwl@hadoop-node-1:~$ sudo /etc/init.d/hadoop-0.20-datanode start
Starting Hadoop datanode daemon: datanode running as process 1400. Stop it first.
hadoop-0.20-datanode.
hwl@hadoop-node-1:~$ sudo /etc/init.d/hadoop-0.20-tasktracker start
Starting Hadoop tasktracker daemon: starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-hadoop-node-1.out
hadoop-0.20-tasktracker.
hwl@hadoop-node-1:~$ sudo jps
1926 TaskTracker
1968 Jps
1428 DataNode
hwl@hadoop-node-2:~$ sudo /etc/init.d/hadoop-0.20-datanode start
Starting Hadoop datanode daemon: datanode running as process 1156. Stop it first.
hadoop-0.20-datanode.
hwl@hadoop-node-2:~$ sudo /etc/init.d/hadoop-0.20-tasktracker start
Starting Hadoop tasktracker daemon: starting tasktracker, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-tasktracker-hadoop-node-2.out
hadoop-0.20-tasktracker.
hwl@hadoop-node-2:~$ sudo jps
1864 TaskTracker
1189 DataNode
1905 Jps
10 创建mapred.system.dir的HDFS目录
hwl@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /mapred/system
14/05/11 19:30:54 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
hwl@hadoop-master:~$ sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system
14/05/11 19:31:11 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
11 测试对HDFS的相关操作
hwl@hadoop-master:~$ echo “Hello” > hello.txt
hwl@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /hwl
14/05/11 19:31:52 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
hwl@hadoop-master:~$ sudo -u hdfs hadoop fs -copyFromLocal hello.txt /hwl
14/05/11 19:32:03 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
hwl@hadoop-master:~$ sudo -u hdfs hadoop fs -ls /hwl
14/05/11 19:32:17 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Found 1 items
-rw-r–r– 2 hdfs supergroup 14 2014-05-11 19:32 /hwl/hello.txt
12 查看集群状态:
12.1 web查看
http://192.168.242.128:50070/
http://192.168.242.128:50030/
12.2 命令行下查看
hwl@hadoop-master:~$ sudo -u hdfs hadoop dfsadmin -report
14/05/11 19:45:11 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Configured Capacity: 252069396480 (234.76 GB)
Present Capacity: 234272096256 (218.18 GB)
DFS Remaining: 234271989760 (218.18 GB)
DFS Used: 106496 (104 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
————————————————-
Datanodes available: 3 (3 total, 0 dead)
Name: 192.168.242.128:50010
Decommission Status : Normal
Configured Capacity: 84023132160 (78.25 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 5935935488 (5.53 GB)
DFS Remaining: 78087155712(72.72 GB)
DFS Used%: 0%
DFS Remaining%: 92.94%
Last contact: Sun May 11 19:45:11 PDT 2014
Name: 192.168.242.129:50010
Decommission Status : Normal
Configured Capacity: 84023132160 (78.25 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 5931614208 (5.52 GB)
DFS Remaining: 78091489280(72.73 GB)
DFS Used%: 0%
DFS Remaining%: 92.94%
Last contact: Sun May 11 19:45:08 PDT 2014
Name: 192.168.242.130:50010
Decommission Status : Normal
Configured Capacity: 84023132160 (78.25 GB)
DFS Used: 36864 (36 KB)
Non DFS Used: 5929750528 (5.52 GB)
DFS Remaining: 78093344768(72.73 GB)
DFS Used%: 0%
DFS Remaining%: 92.94%
Last contact: Sun May 11 19:45:08 PDT 2014