标签(空格分隔): 大数据 HDFS
[toc]
所有的分析以单机安装的Hadoop版本2.6.4为例分析。步骤依赖于安装文档中的步骤,见Hadoop的单机安装
预制几个重要的脚本文件:
- 假设hadoop的安装目录在HADOOP_HOME。
- 重要的脚本文件hadoop-functions.sh。
步骤详解
格式化系统
第一步要:$ bin/hdfs namenode -format
主要执行HADOOP_HOME/bin/hdfs
命令。其中设置了3个重要的变量名
namenode)
HADOOP_SUBCMD_SUPPORTDAEMONIZATION="true"
HADOOP_CLASSNAME='org.apache.hadoop.hdfs.server.namenode.NameNode'
hadoop_add_param HADOOP_OPTS hdfs.audit.logger "-Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER}"
;;
然后最后执行
hadoop_java_exec "${HADOOP_SUBCMD}" "${HADOOP_CLASSNAME}" "${HADOOP_SUBCMD_ARGS[@]}"
其中的hadoop_java_exec是hadoop-functions.sh中声明的一个函数,其作用就是启动java进程执行command。
function hadoop_java_exec
{
# run a java command. this is used for
# non-daemons
local command=$1
local class=$2
shift 2
hadoop_debug "Final CLASSPATH: ${CLASSPATH}"
hadoop_debug "Final HADOOP_OPTS: ${HADOOP_OPTS}"
hadoop_debug "Final JAVA_HOME: ${JAVA_HOME}"
hadoop_debug "java: ${JAVA}"
hadoop_debug "Class name: ${class}"
hadoop_debug "Command line options: $*"
export CLASSPATH
#shellcheck disable=SC2086
exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@"
}
所以,整个命令的链路核心目标就是执行org.apache.hadoop.hdfs.server.namenode.NameNode
类的main函数,传递的参数为format。
public static void main(String argv[]) throws Exception {
if (DFSUtil.parseHelpArgument(argv, NameNode.USAGE, System.out, true)) {
System.exit(0);
}
try {
StringUtils.startupShutdownMessage(NameNode.class, argv, LOG);
NameNode namenode = createNameNode(argv, null);
if (namenode != null) {
namenode.join();
}
} catch (Throwable e) {
LOG.error("Failed to start namenode.", e);
terminate(1, e);
}
}
其中startupShutdownMessage方法会打印一些启动信息到控制台,同时如果是unix系统,会注册logger到signal,在接受 { "TERM", "HUP", "INT" }信号时打印错误日志。这样做的意义在于当有系统信号触发进程结束时,可以根据日志来判断是什么原因退出进程的。
if (SystemUtils.IS_OS_UNIX) {
try {
SignalLogger.INSTANCE.register(LOG);
} catch (Throwable t) {
LOG.warn("failed to register any UNIX signal loggers: ", t);
}
接下来就是createNameNode了,首先解析出-format参数为StartOption.FORMAT,然后执行format方法,由于没有指定cluster,所以系统new一个clusterId,比如形如CID-d2425dab-c066-4a67-954f-32228c22abe6。
private static boolean format(Configuration conf, boolean force,
boolean isInteractive) throws IOException {
String nsId = DFSUtil.getNamenodeNameServiceId(conf);
String namenodeId = HAUtil.getNameNodeId(conf, nsId);
initializeGenericKeys(conf, nsId, namenodeId);
checkAllowFormat(conf);
if (UserGroupInformation.isSecurityEnabled()) {
InetSocketAddress socAddr = DFSUtilClient.getNNAddress(conf);
SecurityUtil.login(conf, DFS_NAMENODE_KEYTAB_FILE_KEY,
DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, socAddr.getHostName());
}
Collection<URI> nameDirsToFormat = FSNamesystem.getNamespaceDirs(conf);
List<URI> sharedDirs = FSNamesystem.getSharedEditsDirs(conf);
List<URI> dirsToPrompt = new ArrayList<URI>();
dirsToPrompt.addAll(nameDirsToFormat);
dirsToPrompt.addAll(sharedDirs);
List<URI> editDirsToFormat =
FSNamesystem.getNamespaceEditsDirs(conf);
// if clusterID is not provided - see if you can find the current one
String clusterId = StartupOption.FORMAT.getClusterId();
if(clusterId == null || clusterId.equals("")) {
//Generate a new cluster id
clusterId = NNStorage.newClusterID();
}
System.out.println("Formatting using clusterid: " + clusterId);
FSImage fsImage = new FSImage(conf, nameDirsToFormat, editDirsToFormat);
try {
FSNamesystem fsn = new FSNamesystem(conf, fsImage);
fsImage.getEditLog().initJournalsForWrite();
if (!fsImage.confirmFormat(force, isInteractive)) {
return true; // aborted
}
fsImage.format(fsn, clusterId);
} catch (IOException ioe) {
LOG.warn("Encountered exception during format: ", ioe);
fsImage.close();
throw ioe;
}
return false;
}
接下来构造一个FSImage,设置默认的checkpoint目录,设置存储以及初始化edit log。其中NNStorage负责管理存储目录,FSEditLog是edit log对象。
protected FSImage(Configuration conf,
Collection<URI> imageDirs,
List<URI> editsDirs)
throws IOException {
this.conf = conf;
storage = new NNStorage(conf, imageDirs, editsDirs);
if(conf.getBoolean(DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_KEY,
DFSConfigKeys.DFS_NAMENODE_NAME_DIR_RESTORE_DEFAULT)) {
storage.setRestoreFailedStorage(true);
}
this.editLog = FSEditLog.newInstance(conf, storage, editsDirs);
archivalManager = new NNStorageRetentionManager(conf, storage, editLog);
}
有了文件系统镜像,就可以构造FSNamesystem了,这是一个namespace状态存储的容器,负责承载NameNode的一切记录性质的工作。具体的构造函数代码较长,这里就不贴明细了。具体分析一下步骤:
1. 先创建KeyProvider,我们这个例子没有安全模式,因此no KeyProvider found。
2. 读取dfs.namenode.fslock.fair,构造FSNamesystemLock,默认true,即公平读写锁。
3. 设置用户和权限
4. check 是否HA
5. 初始化BlockManager及其代理的一堆manager,包括:DatanodeManager(管理DataNode的下线[DecommissionManager]和其他活动),HeartbeatManager(管理从datanode接收到的心跳),BlockIdManager(分配和管理GenerationStamp和block id)等。
6. 构造FSDirectory,这是个纯内存的结构,用来和FSNamesystem一起管理NameNode,构造INode。
7. 初始化CacheManager来管理DataNode的cache。
8. 初始化RetryCache。cache了一些非幂等的被RPCserver成功处理的请求,用以处理重试。
至此FSNamesystem初始化完成,最后执行FSImage的format方法,进行格式化。然后shutdown NameNode。
启动NameNode和DataNode的进程
第二步就是启动NameNode和DataNode了,具体脚本如下:
$ sbin/start-dfs.sh
NameNode启动
脚本核心代码:
#---------------------------------------------------------
# namenodes
NAMENODES=$("${HADOOP_HDFS_HOME}/bin/hdfs" getconf -namenodes 2>/dev/null)
if [[ -z "${NAMENODES}" ]]; then
NAMENODES=$(hostname)
fi
echo "Starting namenodes on [${NAMENODES}]"
hadoop_uservar_su hdfs namenode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--hostnames "${NAMENODES}" \
--daemon start \
namenode ${nameStartOpt}
HADOOP_JUMBO_RETCOUNTER=$?
也就是先hdfs getconf -namenodes来查询配置列出所有NameNode。然后执行hdfs namenode来启动NameNode。根据上面的分析,我们知道hdfs脚本就是启动对应命令的java进程,namenode子命令还是对应NameNode类的main方法,具体执行的其他步骤一样,只是在createNameNode时,因为参数不同而导致逻辑不同。因为启动脚本里namenode没有其他参数,因此启动默认逻辑
default: {
DefaultMetricsSystem.initialize("NameNode");
return new NameNode(conf);
}
核心就是NameNode的构造方法。其首先通过setClientNamenodeAddress方法设置NameNode的地址,默认的就是fs.defaultFS配置对应的值hdfs://localhost:9000。
接着初始化NameNode
protected void initialize(Configuration conf) throws IOException {
if (conf.get(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS) == null) {
String intervals = conf.get(DFS_METRICS_PERCENTILES_INTERVALS_KEY);
if (intervals != null) {
conf.set(HADOOP_USER_GROUP_METRICS_PERCENTILES_INTERVALS,
intervals);
}
}
UserGroupInformation.setConfiguration(conf);
loginAsNameNodeUser(conf);
NameNode.initMetrics(conf, this.getRole());
StartupProgressMetrics.register(startupProgress);
pauseMonitor = new JvmPauseMonitor();
pauseMonitor.init(conf);
pauseMonitor.start();
metrics.getJvmMetrics().setPauseMonitor(pauseMonitor);
if (NamenodeRole.NAMENODE == role) {
startHttpServer(conf);
}
loadNamesystem(conf);
rpcServer = createRpcServer(conf);
initReconfigurableBackoffKey();
if (clientNamenodeAddress == null) {
// This is expected for MiniDFSCluster. Set it now using
// the RPC server's bind address.
clientNamenodeAddress =
NetUtils.getHostPortString(getNameNodeAddress());
LOG.info("Clients are to use " + clientNamenodeAddress + " to access"
+ " this namenode/service.");
}
if (NamenodeRole.NAMENODE == role) {
httpServer.setNameNodeAddress(getNameNodeAddress());
httpServer.setFSImage(getFSImage());
}
startCommonServices(conf);
startMetricsLogger(conf);
}
几个比较重要的步骤,其中startHttpServer会启动一个httpServer,默认地址是http://0.0.0.0:50070。HDFS的默认httpserver是一个Jetty服务器,启动httpserver后,打开页面可以看到整个hdfs的监控情况。然后加载Namesystem,先check参数,由于本地启动,会收到这样两个警告:
2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one image storage directory (dfs.namenode.name.dir) configured. Beware of data loss due to lack of redundant storage
directories!
2017-02-11 21:59:28,765 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Only one namespace edits storage directory (dfs.namenode.edits.dir) configured. Beware of data loss due to lack of redunda
nt storage directories!
无视存储和editlog的存储单目录问题,接下来和format逻辑一样,要构造FSNamesystem。接着就是loadFSImage,FSImage加载后需要判断是否保存,其逻辑上是
final boolean needToSave = staleImage && !haEnabled && !isRollingUpgrade();
由于单机模式,这几个值都是false,因此needToSave也是false,所以不会进行fsImage的saveNamespace方法。
结束后会看到一行日志:
2017-02-11 21:59:29,472 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 349 msecs
表示FSImage加载完毕。
后面跟着初始化RPC server。具体对应的类是RPC.Server,基于Protobuf的一个客户端rpc服务器。
方法的最后两行,startCommonServices会启动所有的*manager和httpServer以及rpcServer,还有如果有配置ServicePlugin,每个plugin也会启动。而startMetricsLogger开启日志记录
DataNode启动
启动脚本
#---------------------------------------------------------
# datanodes (using default workers file)
echo "Starting datanodes"
hadoop_uservar_su hdfs datanode "${HADOOP_HDFS_HOME}/bin/hdfs" \
--workers \
--config "${HADOOP_CONF_DIR}" \
--daemon start \
datanode ${dataStartOpt}
(( HADOOP_JUMBO_RETCOUNTER=HADOOP_JUMBO_RETCOUNTER + $? ))
执行无参数的hdfs datanode。DataNode存储了一系列的block来存放实际的文件数据。DataNode会和NameNode通信,且也会和其他DataNode甚至客户端来通信。DataNode只维护了一个关系block到bytes流的映射关系。
具体DataNode的初始化,首先先初始MetricSystem。接着进入核心的代码段——DataNode的构造函数:
DataNode(final Configuration conf,
final List<StorageLocation> dataDirs,
final StorageLocationChecker storageLocationChecker,
final SecureResources resources) throws IOException {
super(conf);
this.tracer = createTracer(conf);
this.tracerConfigurationManager =
new TracerConfigurationManager(DATANODE_HTRACE_PREFIX, conf);
this.fileIoProvider = new FileIoProvider(conf, this);
this.blockScanner = new BlockScanner(this);
this.lastDiskErrorCheck = 0;
this.maxNumberOfBlocksToLog = conf.getLong(DFS_MAX_NUM_BLOCKS_TO_LOG_KEY,
DFS_MAX_NUM_BLOCKS_TO_LOG_DEFAULT);
this.usersWithLocalPathAccess = Arrays.asList(
conf.getTrimmedStrings(DFSConfigKeys.DFS_BLOCK_LOCAL_PATH_ACCESS_USER_KEY));
this.connectToDnViaHostname = conf.getBoolean(
DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME,
DFSConfigKeys.DFS_DATANODE_USE_DN_HOSTNAME_DEFAULT);
this.supergroup = conf.get(DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_KEY,
DFSConfigKeys.DFS_PERMISSIONS_SUPERUSERGROUP_DEFAULT);
this.isPermissionEnabled = conf.getBoolean(
DFSConfigKeys.DFS_PERMISSIONS_ENABLED_KEY,
DFSConfigKeys.DFS_PERMISSIONS_ENABLED_DEFAULT);
this.pipelineSupportECN = conf.getBoolean(
DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED,
DFSConfigKeys.DFS_PIPELINE_ECN_ENABLED_DEFAULT);
confVersion = "core-" +
conf.get("hadoop.common.configuration.version", "UNSPECIFIED") +
",hdfs-" +
conf.get("hadoop.hdfs.configuration.version", "UNSPECIFIED");
this.volumeChecker = new DatasetVolumeChecker(conf, new Timer());
// Determine whether we should try to pass file descriptors to clients.
if (conf.getBoolean(HdfsClientConfigKeys.Read.ShortCircuit.KEY,
HdfsClientConfigKeys.Read.ShortCircuit.DEFAULT)) {
String reason = DomainSocket.getLoadingFailureReason();
if (reason != null) {
LOG.warn("File descriptor passing is disabled because " + reason);
this.fileDescriptorPassingDisabledReason = reason;
} else {
LOG.info("File descriptor passing is enabled.");
this.fileDescriptorPassingDisabledReason = null;
}
} else {
this.fileDescriptorPassingDisabledReason =
"File descriptor passing was not configured.";
LOG.debug(this.fileDescriptorPassingDisabledReason);
}
this.socketFactory = NetUtils.getDefaultSocketFactory(conf);
try {
hostName = getHostName(conf);
LOG.info("Configured hostname is " + hostName);
startDataNode(dataDirs, resources);
} catch (IOException ie) {
shutdown();
throw ie;
}
final int dncCacheMaxSize =
conf.getInt(DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_KEY,
DFS_DATANODE_NETWORK_COUNTS_CACHE_MAX_SIZE_DEFAULT) ;
datanodeNetworkCounts =
CacheBuilder.newBuilder()
.maximumSize(dncCacheMaxSize)
.build(new CacheLoader<String, Map<String, Long>>() {
@Override
public Map<String, Long> load(String key) throws Exception {
final Map<String, Long> ret = new HashMap<String, Long>();
ret.put("networkErrors", 0L);
return ret;
}
});
initOOBTimeout();
this.storageLocationChecker = storageLocationChecker;
}
而其中最重要的就是startDataNode方法。其核心步骤摘要如下:
1. 注册MBean
2. 创建一个TcpPeerServer,监听50010端口。该server负责和Client和其他DataNode通信。此server不使用Hadoop的IPC机制
3. 启动JvmPauseManager,用于记录Jvm的暂停,发现则log一条
4. 初始化IpcServer,监听50020端口。
5. 构造一个BPOfferService线程,然后启动线程。BPServiceActor是这样一个线程,它会先和NameNode进行握手做预注册,接下来注册DataNode到NameNode,然后周期性的发送心跳给NameNode,并处理接收到的response命令。
具体描述步骤5,就是如下代码:
public void run() {
LOG.info(this + " starting to offer service");
try {
while (true) {
// init stuff
try {
// setup storage
connectToNNAndHandshake();
break;
} catch (IOException ioe) {
// Initial handshake, storage recovery or registration failed
runningState = RunningState.INIT_FAILED;
if (shouldRetryInit()) {
// Retry until all namenode's of BPOS failed initialization
LOG.error("Initialization failed for " + this + " "
+ ioe.getLocalizedMessage());
sleepAndLogInterrupts(5000, "initializing");
} else {
runningState = RunningState.FAILED;
LOG.error("Initialization failed for " + this + ". Exiting. ", ioe);
return;
}
}
}
runningState = RunningState.RUNNING;
if (initialRegistrationComplete != null) {
initialRegistrationComplete.countDown();
}
while (shouldRun()) {
try {
offerService();
} catch (Exception ex) {
LOG.error("Exception in BPOfferService for " + this, ex);
sleepAndLogInterrupts(5000, "offering service");
}
}
runningState = RunningState.EXITED;
} catch (Throwable ex) {
LOG.warn("Unexpected exception in block pool " + this, ex);
runningState = RunningState.FAILED;
} finally {
LOG.warn("Ending block pool service for: " + this);
cleanUp();
}
}
下面具体分析一下BPServiceActor线程做的几件事:
1. 发送versionRequest请求给NameNode,来获取NameNode的namespace和版本信息。响应得到一个NamespaceInfo。
2. 利用NamespaceInfo初始化Storage,初始化之前先做格式化format。初始化后生成一个uuid,具体可以看到如下的日志:
2017-02-11 21:59:33,901 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=537369943;bpid=BP-503975772-192.168.0.109-1486821555429;lv=-56;nsInfo=lv=-60;cid=CID-c79cc043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0;bpid=BP-503975772-192.168.0.109-1486821555429;dnuuid=null
2017-02-11 21:59:33,902 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Generated and persisted new Datanode UUID 43ed99d1-20c6-4d71-919c-e9a70cb75c6e
- 真实握手,发送registerDatanode请求给NameNode。这时NameNode会处理这个请求,利用DataNodeManager来进行registerDatanode。这时在NameNode日志会看到如下的日志:
2017-02-11 21:59:34,090 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* register
Datanode: from DatanodeRegistration(127.0.0.1, datanodeUuid=43ed99d1-20c6-4d71-9
19c-e9a70cb75c6e, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-c79c
c043-b282-435c-a0f6-d5a55b23e87e;nsid=537369943;c=0) storage 43ed99d1-20c6-4d71-
919c-e9a70cb75c6e
2017-02-11 21:59:34,099 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,100 INFO org.apache.hadoop.net.NetworkTopology: Adding a new
node: /default-rack/127.0.0.1:50010
2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Number of failed storage changes from 0 to 0
2017-02-11 21:59:34,189 INFO org.apache.hadoop.hdfs.server.blockmanagement.Datan
odeDescriptor: Adding new storage ID DS-7d302778-acd6-4366-be5e-9dbf7ad22c4d for
DN 127.0.0.1:50010 - 调用offerService方法,开始周期性发送心跳。每个心跳包都包含几个内容:DataNode名字、数据传输端口、总容量和剩余bytes。然后NameNode接受到心跳后开始handleHeartbeat。
至此,整个NameNode和DataNode都开始正常工作,整个HDFS的启动结束。