Spark平台（高级版五）Hadoop安装

发表于2020-07-072020-07-18 作者 chpdirector

本章节目录

完整目录、平台简介、安装环境及版本：参考《Spark平台（高级版）概览》

五、Hadoop3-大数据基础组件

首先搭建基本集群环境，在搭建好整个集群环境后，可以边操作边理解Hadoop集群的每一个功能，如HDFS、MapReduce、Yarn等。在基本环境之上，再进行改造，加入每章节涉及的新的内容，这样的话，可以更早的触摸集群，能够快速理解。

5.1 基本集群环境搭建

5.1.1 集群拓扑图

本部分主要讲述如何搭建集群环境，如下所示的逻辑部署形式包括两个NN（NameNode）、两个RM（ResourceManager）以及三个DN（DataNode）节点。

对应的物理部署形式如下：在app-11上部署了NameNode1以及ResourceManager1，在app-12上部署了NameNode2以及ResourceManager2，三个JournalNode以及三个DataNode分别部署在app-11、app-12和app-13上。

5.1.2 集群搭建

启动app-11、app-12、app-13节点。

5.1.2.1 app-11节点

5.1.2.1.1 启停ZooKeeper

制作可以一下自动多个节点的脚本，避免每次登陆各个节点启动和关停Zookerper

用ssh连接到app-11节点，并切换到hadoop用户：su – hadoop
切换到/Hadoop/tools目录
将\安装资源3\Hadoop\tools目录下的三个文件上传到/hadoop/tools目录，三个人间分别对应启动和停止zookeeper脚本，替代一个个节点手动启动。

将启动和停止脚本增加可执行权限：chmod a+x *.sh。

如果zookeeper已经启动，可以先运行停止脚本，停止脚本可以多次运行，无副作用，即使在zookeeper未启动的情况下也可以运行该脚本。

脚本：：/hadoop/tools/remoteSSHNOTroot.exp

#!/usr/local/bin/expect
####################################
# Name  : sshLoginTest.exp
# Desc   : auto switch to usr($argv 0) with password($argv 1)
#             and host($argv 2)
#             and execute cmd($argv 3)
# Use     : /usr/local/bin/expect sshLoginTest.exp root *** "pwd"
####################################
set user [ lindex $argv 0 ]
set passwd [ lindex $argv 1 ]
set host [ lindex $argv 2 ]
set cmd [ lindex $argv 3 ]
set timeout -1
spawn ssh $user@$host
expect {
    "continue connecting (yes/no)? " {
            send "yes\n";
            exp_continue;
    }
	"password: " {
            send "$passwd\n";
            exp_continue;
    }
    "*]# " { }
    "*]$ " { }
}

send "$cmd\n"
expect {
    "*]# " { }
    "*]$ " { }
}

send "exit\n"
expect {
    "*]# " { }
    "*]$ " { }
}
return 0;

停止脚本：/hadoop/tools/stopZookeeper.sh

#!/bin/sh
nodeArray="app-13:2888:3888 app-12:2888:3888 app-11:2888:3888 "
for node in $nodeArray
do
	t=$(echo $node | cut -d ":" -f1)
	/hadoop/tools/expect/bin/expect /hadoop/tools/remoteSSHNOTroot.exp hadoop Yhf_1018 $t "zkServer.sh stop"
done

运行启动脚本：/hadoop/tools/startZookeeper.sh

#!/bin/sh
nodeArray="app-13:2888:3888 app-12:2888:3888 app-11:2888:3888 "
for node in $nodeArray
do
	t=$(echo $node | cut -d ":" -f1)
	/hadoop/tools/expect/bin/expect /hadoop/tools/remoteSSHNOTroot.exp hadoop Yhf_1018 $t "zkServer.sh start"
	/hadoop/tools/expect/bin/expect /hadoop/tools/remoteSSHNOTroot.exp hadoop Yhf_1018 $t "zkServer.sh status"
done

确认是否启动成功：

5.1.2.1.2 安装hadoop

创建hadoop安装目录：mkdir /hadoop/Hadoop

上传hadoop安装组件hadoop-3.1.2.tar.gz到安装目录/hadoop/Hadoop

创建配置文件目录：mkdir conf

上传7个已经准备好的hadoop配置文件

capacity-scheduler.xml：yarn容量调度配置文件（详细情况参考<Spark平台（高级版五）Hadoop_YARN>）

<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>
      Maximum number of applications that can be pending and running.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.5</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare 
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare 
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>100</value>
    <description>Default queue target capacity.</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
    <value>1</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>100</value>
    <description>
      The maximum capacity of the default queue. 
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_application_max_priority</name>
    <value>*</value>
    <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
  </property>

   <property>
     <name>yarn.scheduler.capacity.root.default.maximum-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
   </property>

   <property>
     <name>yarn.scheduler.capacity.root.default.default-application-lifetime
     </name>
     <value>-1</value>
     <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
   </property>

  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>40</value>
    <description>
      Number of missed scheduling opportunities after which the CapacityScheduler 
      attempts to schedule rack-local containers.
      When setting this parameter, the size of the cluster should be taken into account.
      We use 40 as the default value, which is approximately the number of nodes in one rack.
      Note, if this value is -1, the locality constraint in the container request
      will be ignored, which disables the delay scheduling.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
    <value>-1</value>
    <description>
      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value></value>
    <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
    <value>1</value>
    <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
  </property>


  <property>
    <name>yarn.scheduler.capacity.application.fail-fast</name>
    <value>false</value>
    <description>
      Whether RM should fail during recovery if previous applications'
      queue is no longer valid.
    </description>
  </property>

</configuration>

修改部分：将值从0.1调整到0.5

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.5</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

hadoop-env.sh：脚本中需要用到的环境变量，便于运行Hadoop；（详细情况参考<Spark平台（高级版五）Hadoop_HDFS>）

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hadoop-specific environment variables here.

##
## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.
## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,
## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE
## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.
##
## Precedence rules:
##
## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults
##
## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults
##

# Many of the options here are built from the perspective that users
# may want to provide OVERWRITING values on the command line.
# For example:
#
#  JAVA_HOME=/usr/java/testing hdfs dfs -ls
#
# Therefore, the vast majority (BUT NOT ALL!) of these defaults
# are configured for substitution and not append.  If append
# is preferable, modify this file accordingly.

###
# Generic settings for HADOOP
###

# Technically, the only required environment variable is JAVA_HOME.
# All others are optional.  However, the defaults are probably not
# preferred.  Many sites configure these options outside of Hadoop,
# such as in /etc/profile.d

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/hadoop/JDK/jdk1.8.0_131

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/hadoop/Hadoop/hadoop-3.1.2

# Location of Hadoop's configuration information.  i.e., where this
# file is living. If this is not defined, Hadoop will attempt to
# locate it based upon its execution path.
#
# NOTE: It is recommend that this variable not be set here but in
# /etc/profile.d or equivalent.  Some options (such as
# --config) may react strangely otherwise.
#
# export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

# The maximum amount of heap to use (Java -Xmx).  If no unit
# is provided, it will be converted to MB.  Daemons will
# prefer any Xmx setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine
# memory size.
# export HADOOP_HEAPSIZE_MAX=

# The minimum amount of heap to use (Java -Xms).  If no unit
# is provided, it will be converted to MB.  Daemons will
# prefer any Xms setting in their respective _OPT variable.
# There is no default; the JVM will autoscale based upon machine
# memory size.
# export HADOOP_HEAPSIZE_MIN=

# Enable extra debugging of Hadoop's JAAS binding, used to set up
# Kerberos security.
# export HADOOP_JAAS_DEBUG=true

# Extra Java runtime options for all Hadoop commands. We don't support
# IPv6 yet/still, so by default the preference is set to IPv4.
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"
# For Kerberos debugging, an extended option set logs more invormation
# export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true -Dsun.security.krb5.debug=true -Dsun.security.spnego.debug"

# Some parts of the shell code may do special things dependent upon
# the operating system.  We have to set this here. See the next
# section as to why....
export HADOOP_OS_TYPE=${HADOOP_OS_TYPE:-$(uname -s)}

# Extra Java runtime options for some Hadoop commands
# and clients (i.e., hdfs dfs -blah).  These get appended to HADOOP_OPTS for
# such commands.  In most cases, # this should be left empty and
# let users supply it on the command line.
# export HADOOP_CLIENT_OPTS=""

#
# A note about classpaths.
#
# By default, Apache Hadoop overrides Java's CLASSPATH
# environment variable.  It is configured such
# that it sarts out blank with new entries added after passing
# a series of checks (file/dir exists, not already listed aka
# de-deduplication).  During de-depulication, wildcards and/or
# directories are *NOT* expanded to keep it simple. Therefore,
# if the computed classpath has two specific mentions of
# awesome-methods-1.0.jar, only the first one added will be seen.
# If two directories are in the classpath that both contain
# awesome-methods-1.0.jar, then Java will pick up both versions.

# An additional, custom CLASSPATH. Site-wide configs should be
# handled via the shellprofile functionality, utilizing the
# hadoop_add_classpath function for greater control and much
# harder for apps/end-users to accidentally override.
# Similarly, end users should utilize ${HOME}/.hadooprc .
# This variable should ideally only be used as a short-cut,
# interactive way for temporary additions on the command line.
# export HADOOP_CLASSPATH="/some/cool/path/on/your/machine"

# Should HADOOP_CLASSPATH be first in the official CLASSPATH?
# export HADOOP_USER_CLASSPATH_FIRST="yes"

# If HADOOP_USE_CLIENT_CLASSLOADER is set, the classpath along
# with the main jar are handled by a separate isolated
# client classloader when 'hadoop jar', 'yarn jar', or 'mapred job'
# is utilized. If it is set, HADOOP_CLASSPATH and
# HADOOP_USER_CLASSPATH_FIRST are ignored.
# export HADOOP_USE_CLIENT_CLASSLOADER=true

# HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES overrides the default definition of
# system classes for the client classloader when HADOOP_USE_CLIENT_CLASSLOADER
# is enabled. Names ending in '.' (period) are treated as package names, and
# names starting with a '-' are treated as negative matches. For example,
# export HADOOP_CLIENT_CLASSLOADER_SYSTEM_CLASSES="-org.apache.hadoop.UserClass,java.,javax.,org.apache.hadoop."

# Enable optional, bundled Hadoop features
# This is a comma delimited list.  It may NOT be overridden via .hadooprc
# Entries may be added/removed as needed.
# export HADOOP_OPTIONAL_TOOLS="hadoop-aliyun,hadoop-aws,hadoop-azure-datalake,hadoop-azure,hadoop-kafka,hadoop-openstack"

###
# Options for remote shell connectivity
###

# There are some optional components of hadoop that allow for
# command and control of remote hosts.  For example,
# start-dfs.sh will attempt to bring up all NNs, DNS, etc.

# Options to pass to SSH when one of the "log into a host and
# start/stop daemons" scripts is executed
# export HADOOP_SSH_OPTS="-o BatchMode=yes -o StrictHostKeyChecking=no -o ConnectTimeout=10s"

# The built-in ssh handler will limit itself to 10 simultaneous connections.
# For pdsh users, this sets the fanout size ( -f )
# Change this to increase/decrease as necessary.
# export HADOOP_SSH_PARALLEL=10

# Filename which contains all of the hosts for any remote execution
# helper scripts # such as workers.sh, start-dfs.sh, etc.
# export HADOOP_WORKERS="${HADOOP_CONF_DIR}/workers"

###
# Options for all daemons
###
#

#
# Many options may also be specified as Java properties.  It is
# very common, and in many cases, desirable, to hard-set these
# in daemon _OPTS variables.  Where applicable, the appropriate
# Java property is also identified.  Note that many are re-used
# or set differently in certain contexts (e.g., secure vs
# non-secure)
#

# Where (primarily) daemon log files are stored.
# ${HADOOP_HOME}/logs by default.
# Java property: hadoop.log.dir
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# A string representing this instance of hadoop. $USER by default.
# This is used in writing log and pid files, so keep that in mind!
# Java property: hadoop.id.str
# export HADOOP_IDENT_STRING=$USER

# How many seconds to pause after stopping a daemon
# export HADOOP_STOP_TIMEOUT=5

# Where pid files are stored.  /tmp by default.
export HADOOP_PID_DIR=/hadoop/Hadoop/hadoop-3.1.2/tmp

# Default log4j setting for interactive commands
# Java property: hadoop.root.logger
# export HADOOP_ROOT_LOGGER=INFO,console

# Default log4j setting for daemons spawned explicitly by
# --daemon option of hadoop, hdfs, mapred and yarn command.
# Java property: hadoop.root.logger
# export HADOOP_DAEMON_ROOT_LOGGER=INFO,RFA

# Default log level and output location for security-related messages.
# You will almost certainly want to change this on a per-daemon basis via
# the Java property (i.e., -Dhadoop.security.logger=foo). (Note that the
# defaults for the NN and 2NN override this by default.)
# Java property: hadoop.security.logger
# export HADOOP_SECURITY_LOGGER=INFO,NullAppender

# Default process priority level
# Note that sub-processes will also run at this level!
# export HADOOP_NICENESS=0

# Default name for the service level authorization file
# Java property: hadoop.policy.file
# export HADOOP_POLICYFILE="hadoop-policy.xml"

#
# NOTE: this is not used by default!  <-----
# You can define variables right here and then re-use them later on.
# For example, it is common to use the same garbage collection settings
# for all the daemons.  So one could define:
#
# export HADOOP_GC_SETTINGS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps"
#
# .. and then use it as per the b option under the namenode.

###
# Secure/privileged execution
###

#
# Out of the box, Hadoop uses jsvc from Apache Commons to launch daemons
# on privileged ports.  This functionality can be replaced by providing
# custom functions.  See hadoop-functions.sh for more information.
#

# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol.  Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
# export JSVC_HOME=/usr/bin

#
# This directory contains pids for secure and privileged processes.
#export HADOOP_SECURE_PID_DIR=${HADOOP_PID_DIR}

#
# This directory contains the logs for secure and privileged processes.
# Java property: hadoop.log.dir
# export HADOOP_SECURE_LOG=${HADOOP_LOG_DIR}

#
# When running a secure daemon, the default value of HADOOP_IDENT_STRING
# ends up being a bit bogus.  Therefore, by default, the code will
# replace HADOOP_IDENT_STRING with HADOOP_xx_SECURE_USER.  If one wants
# to keep HADOOP_IDENT_STRING untouched, then uncomment this line.
# export HADOOP_SECURE_IDENT_PRESERVE="true"

###
# NameNode specific parameters
###

# Default log level and output location for file system related change
# messages. For non-namenode daemons, the Java property must be set in
# the appropriate _OPTS if one wants something other than INFO,NullAppender
# Java property: hdfs.audit.logger
# export HDFS_AUDIT_LOGGER=INFO,NullAppender

# Specify the JVM options to be used when starting the NameNode.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# a) Set JMX options
# export HDFS_NAMENODE_OPTS="-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1026"
#
# b) Set garbage collection logs
# export HDFS_NAMENODE_OPTS="${HADOOP_GC_SETTINGS} -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"
#
# c) ... or set them directly
# export HDFS_NAMENODE_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:${HADOOP_LOG_DIR}/gc-rm.log-$(date +'%Y%m%d%H%M')"

# this is the default:
# export HDFS_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"

###
# SecondaryNameNode specific parameters
###
# Specify the JVM options to be used when starting the SecondaryNameNode.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# This is the default:
# export HDFS_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=INFO,RFAS"

###
# DataNode specific parameters
###
# Specify the JVM options to be used when starting the DataNode.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# This is the default:
# export HDFS_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
# This will replace the hadoop.id.str Java property in secure mode.
# export HDFS_DATANODE_SECURE_USER=hdfs

# Supplemental options for secure datanodes
# By default, Hadoop uses jsvc which needs to know to launch a
# server jvm.
# export HDFS_DATANODE_SECURE_EXTRA_OPTS="-jvm server"

###
# NFS3 Gateway specific parameters
###
# Specify the JVM options to be used when starting the NFS3 Gateway.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_NFS3_OPTS=""

# Specify the JVM options to be used when starting the Hadoop portmapper.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_PORTMAP_OPTS="-Xmx512m"

# Supplemental options for priviliged gateways
# By default, Hadoop uses jsvc which needs to know to launch a
# server jvm.
# export HDFS_NFS3_SECURE_EXTRA_OPTS="-jvm server"

# On privileged gateways, user to run the gateway as after dropping privileges
# This will replace the hadoop.id.str Java property in secure mode.
# export HDFS_NFS3_SECURE_USER=nfsserver

###
# ZKFailoverController specific parameters
###
# Specify the JVM options to be used when starting the ZKFailoverController.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_ZKFC_OPTS=""

###
# QuorumJournalNode specific parameters
###
# Specify the JVM options to be used when starting the QuorumJournalNode.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_JOURNALNODE_OPTS=""

###
# HDFS Balancer specific parameters
###
# Specify the JVM options to be used when starting the HDFS Balancer.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_BALANCER_OPTS=""

###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_MOVER_OPTS=""

###
# Router-based HDFS Federation specific parameters
# Specify the JVM options to be used when starting the RBF Routers.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HDFS_DFSROUTER_OPTS=""
###

###
# Advanced Users Only!
###

#
# When building Hadoop, one can add the class paths to the commands
# via this special env var:
# export HADOOP_ENABLE_BUILD_PATHS="true"

#
# To prevent accidents, shell commands be (superficially) locked
# to only allow certain users to execute certain subcommands.
# It uses the format of (command)_(subcommand)_USER.
#
# For example, to limit who can execute the namenode command,
# export HDFS_NAMENODE_USER=hdfs

调整部分：JDK的安装路径和Haoddp相关路径

# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/hadoop/JDK/jdk1.8.0_131

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/hadoop/Hadoop/hadoop-3.1.2

# Where (primarily) daemon log files are stored.
# ${HADOOP_HOME}/logs by default.
# Java property: hadoop.log.dir
export HADOOP_LOG_DIR=${HADOOP_HOME}/logs

# Where pid files are stored.  /tmp by default.
export HADOOP_PID_DIR=/hadoop/Hadoop/hadoop-3.1.2/tmp

core-site.xml：核心配置，包括HDFS、MapReduce和YARN常用的I/O设置；（全部新增加，详细情况参考<Spark平台（高级版五）Hadoop_HDFS>）

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://dmcluster</value>
	</property>
	<property>
		<name>ha.zookeeper.quorum</name>
		<value>app-11:2181,app-12:2181,app-13:2181</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/hadoop/Hadoop/hadoop-3.1.2/tmp</value>
	</property>
    <property><name>hadoop.proxyuser.hadoop.hosts</name><value>*</value></property>
    <property><name>hadoop.proxyuser.hadoop.groups</name><value>*</value></property>
</configuration>

hdfs-site.xml：Hadoop的守护进程配置，包括namenode、辅助namenode和datanode等；（全部新增加，详细情况参考<Spark平台（高级版五）Hadoop_HDFS>）

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.nameservices</name>
		<value>dmcluster</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>/hadoop/Hadoop/hadoop-3.1.2/hdfs/name</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/hadoop/Hadoop/hadoop-3.1.2/hdfs/data</value>
	</property>
	<property>
		<name>dfs.ha.namenodes.dmcluster</name>
		<value>nn1,nn2</value>
	</property>
	<property>
		<name>dfs.namenode.shared.edits.dir</name>
		<value>qjournal://app-11:8485;app-12:8485;app-13:8485/dmcluster</value>
	</property>
	<property>
		<name>dfs.client.failover.proxy.provider.dmcluster</name>
		<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
	</property>
    <property>
		<name>dfs.ha.fencing.methods</name>
		<value>shell(/bin/true)</value>
    </property>
	<property>
		<name>dfs.ha.automatic-failover.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>dfs.journalnode.edits.dir</name>
		<value>/hadoop/Hadoop/hadoop-3.1.2/data/journals</value>
	</property>
	
	<!--
	<property>
		<name>dfs.namenode.rpc-address.dmcluster.nn1</name>
		<value>app-11:8020</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.dmcluster.nn2</name>
		<value>app-12:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.dmcluster.nn1</name>
		<value>app-11:9870</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.dmcluster.nn2</name>
		<value>app-12:9870</value>
	</property>
	-->
	<property>
		<name>dfs.namenode.rpc-address.dmcluster.nn1</name>
		<value>app-12:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.dmcluster.nn1</name>
		<value>app-12:9870</value>
	</property>
	<property>
		<name>dfs.namenode.rpc-address.dmcluster.nn2</name>
		<value>app-11:8020</value>
	</property>
	<property>
		<name>dfs.namenode.http-address.dmcluster.nn2</name>
		<value>app-11:9870</value>
	</property>
</configuration>

mapred-site.xml：MapReduced守护进程配置项，包括作业历史服务器。（全部新增加，详细情况参考<Spark平台（高级版六）Tez>）

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
	
	<property>
		 <name>mapreduce.application.classpath</name>
		 <value>
		  /hadoop/Hadoop/hadoop-3.1.2/etc/hadoop,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/common/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/common/lib/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/hdfs/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/hdfs/lib/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/mapreduce/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/mapreduce/lib/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/yarn/*,
		  /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/yarn/lib/*
		 </value>
	 </property>
	<property>
		<name>yarn.app.mapreduce.am.staging-dir</name>
		<value>/hadoop/Hadoop/hadoop-3.1.2/tmp/hadoop-yarn/staging</value>
	</property>
	<!--history web address-->
	<property>
		<name>mapreduce.jobhistory.address</name>
		<value>app-12:10020</value>
	</property>
	<property>
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>app-12:19888</value>
	</property>
</configuration>

yarn-site.xml：YARN守护进程配置，包括资源管理器、WEB应用代理服务器和节点管理器；（全部新增加，详细情况参考<Spark平台（高级版五）Hadoop_YARN>）

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
	<property>  
		<name>yarn.nodemanager.aux-services</name>  
		<value>mapreduce_shuffle</value>  
	</property>  
	<property>  
		<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>  
		<value>org.apache.hadoop.mapred.ShuffleHandle</value>  
	</property>
  <property>
    <name>yarn.node-labels.fs-store.root-dir</name>
    <value>/hadoop/Hadoop/hadoop-3.1.2/tmp/hadoop-yarn-${user}/node-labels/</value>
  </property>
  <property>
    <name>yarn.node-attribute.fs-store.root-dir</name>
    <value>/hadoop/Hadoop/hadoop-3.1.2/tmp/hadoop-yarn-${user}/node-attribute/</value>
  </property>
  <property>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/hadoop/Hadoop/hadoop-3.1.2/tmp/logs</value>
  </property>
  <property>
    <name>yarn.timeline-service.entity-group-fs-store.active-dir</name>
    <value>/hadoop/Hadoop/hadoop-3.1.2/tmp/entity-file-history/active</value>
  </property>
  <property>
    <name>yarn.timeline-service.entity-group-fs-store.done-dir</name>
    <value>/hadoop/Hadoop/hadoop-3.1.2/tmp/entity-file-history/done/</value>
  </property>
<!--HA configure-->
	<property>
		<name>yarn.resourcemanager.ha.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>yarn.resourcemanager.cluster-id</name>
		<value>rmCluster</value>
	</property>
	<property>
		<name>yarn.resourcemanager.ha.rm-ids</name>
		<value>rm1,rm2</value>
	</property>
	<property>
		<name>hadoop.zk.address</name>
		<value>app-11:2181,app-12:2181,app-13:2181</value>
	</property>
	<property>
		<name>yarn.nodemanager.vmem-check-enabled</name>
		<value>false</value>
	</property>
	<!--
	<property>
	  <name>yarn.resourcemanager.hostname.rm1</name>
	  <value>master1</value>
	</property>
	<property>
	  <name>yarn.resourcemanager.hostname.rm2</name>
	  <value>master2</value>
	</property>
	<property>
	  <name>yarn.resourcemanager.webapp.address.rm1</name>
	  <value>master1:8088</value>
	</property>
	<property>
	  <name>yarn.resourcemanager.webapp.address.rm2</name>
	  <value>master2:8088</value>
	</property>
	-->
	<property>
		<name>yarn.resourcemanager.hostname.rm1</name>
		<value>app-11</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address.rm1</name>
		<value>app-11:8088</value>
	</property>
	<property>
		<name>yarn.resourcemanager.hostname.rm2</name>
		<value>app-12</value>
	</property>
	<property>
		<name>yarn.resourcemanager.webapp.address.rm2</name>
		<value>app-12:8088</value>
	</property>
</configuration>

workers：文件（全部新增加，详细情况参考<Spark平台（高级版五）Hadoop_HDFS>）

app-11
app-12
app-13

解压缩安装组件hadoop-3.1.2.tar.gz：tar -xf hadoop-3.1.2.tar.gz

解压缩完成后，默认到目录hadoop-3.1.2

查看已有默认的配置文件：hadoop-3.1.2/etc/hadoop/

将上传的配置文件拷贝到hadoop的配置文件目录，并覆盖

确认是否被覆盖，查看其中一个文件：cat hadoop-3.1.2/etc/hadoop/workers

5.1.2.1.3 环境变量

将hadoop的HOME目录和所有bin目录添加到环境变量中：vi ~/.bashrc

export HADOOP_HOME=/hadoop/Hadoop/hadoop-3.1.2
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${HADOOP_HOME}/lib:$PATH

修改的环境变量生效：source ~/.bashrc

5.1.2.2 app-12节点

通过app-11节点在app-12上建目录：

命令：ssh hadoop@app-12 "mkdir /hadoop/Hadoop"

将app-11节点的环境变量拷贝到app-12

命令：scp ~/.bashrc hadoop@app-12:~/

将app-11节点的安装目录拷贝到app-12

命令：scp -r -q hadoop-3.1.2 hadoop@app-12:/hadoop/Hadoop/

修改的环境变量生效：source ~/.bashrc

5.1.2.3 app-13节点

通过app-11节点在app-13上建目录

命令：ssh hadoop@app-13 "mkdir /hadoop/Hadoop"

将app-11节点的环境变量拷贝到app-13

命令：scp ~/.bashrc hadoop@app-13:~/

将app-11节点的安装目录拷贝到app-13

命令：scp -r -q hadoop-3.1.2 hadoop@app-13:/hadoop/Hadoop/

修改的环境变量生效：source ~/.bashrc

5.1.2.4 启动journalnode

在app-11节点使用循环脚本循环登录启动整个journalnode：

命令：for name in app-11 app-12 app-13; do ssh hadoop@$name "hdfs --daemon start journalnode"; done

tmp和logs目录并没有手动创建，则在启动过程中会自动创建。

使用jps命令查看journalnode是否启动

命令：for name in app-11 app-12 app-13; do ssh hadoop@$name "jps"; done

可以看到有三个journalnode的守护进程以及三个zookeeper守护进程。

5.1.2.5 NameNode HA

app-11节点和app-12节点部署NameNode，并做HA。

5.1.2.5.1 app-11

在app-11节点上，格式化文件系统，格式化之前journalnode必须是启动的。

命令：hdfs namenode -format

停止journalnode

命令：for name in app-11 app-12 app-13; do ssh hadoop@$name "hdfs --daemon stop journalnode"; done

命令：for name in app-11 app-12 app-13; do ssh hadoop@$name "jps"; done

可以看出三个节点的journalnode都关闭了。

接着创建HA主从应用的zookeeper目录树，zookeeper目录的主从应用靠目录树达到协同应用的目的。

先查看ZooKeeper根目录

zookeeper初始化：hdfs zkfc -formatZK –force

生成根目录hadoop-ha。

登录客户端：zkCli.sh

根目录下面多了hadoop-ha目录及子目录dmcluster，dmcluster目录是创建HA集群，集群里面有两个NameNode，分别为app-11和app-12，这个在配置文件里面有，在后续章节会讲到。

退出

5.1.2.5.2 app-12

将app-11节点NameNode初始化信息同步到app-12节点。

在app-11上启动整个DFS系统：start-dfs.sh

先启动NameNode，然后启动DataNode，最后两步分别启动JournalNode以及做HA的zookeeper FC守护进程。

jps查看

可以看出JournalNode有三个节点，HA在NameNode有两个节点。

把app-11节点的NameNode的数据同步到app-12没有格式化的NameNode。

命令：ssh hadoop@app-12 "hdfs namenode -bootstrapStandby"

关闭DFS，后续需要通过其他途径启动整个集群：stop-dfs.sh

查看jps

5.1.3 启动集群

先确认zookeeper是否启动，没有启动的话需要先启动zookeeper。

启动整个集群：start-all.sh

启动过程会较长，包括HDFS的启动、MapReduce相关的启动、Yarn相关的启动。

注：如果没启动zookeeper的话，先启动zookeeper

查看进程：jps

可以看出，各个节点的进程和启动时的显示是一致的。

启动过程相对DFS多了后面两个步骤，即HDFS HA以及YARN HA。

判断HDFS HA是否成功：hdfs haadmin -getAllServiceState

判断YARN HA是否成功：yarn rmadmin -getAllServiceState

5.1.4 测试

查看HDFS内容：hdfs dfs -ls /

没有输出，是因为HDFS除了根目录以外，还没有创建任何东西。

创建/test/data目录：hdfs dfs -mkdir -p /test/data

并将所有需要做MapReduce的数据拷贝到data目录下

将所有hadoop配置文件上传到data目录

命令：hdfs dfs -put /hadoop/Hadoop/hadoop-3.1.2/etc/hadoop/*.xml /test/data

用hadoop命令提交一个MapReduce任务

任务用hadoop自带的例子hadoop-mapreduce-examples-3.1.2.jar，利用grep功能，输入为/test/data目录下的文件，输出为/test/output，注意这个输出目录不能存在，系统会自动创建输出目录，否则会运行失败。

查询功能：这则表达式表示，以dfs开头，接着以小写字母以及.的字符串。

hadoop jar /hadoop/Hadoop/hadoop-3.1.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep /test/data /test/output "dfs[a-z.]+"

例子：/hadoop/Hadoop/hadoop-3.1.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar

输入：/test/data
输出：/test/output
操作：grep
查询：dfs[a-z.]+

这个过程分两步，一步为grep查满足条件的行，然后做排序。

Job job_1561093521861_0001 completed successfully

Job job_1561093521861_0002 completed successfully

查找结果文件：hdfs dfs -ls /test/output

查看运行结果：hdfs dfs -cat /test/output/part-r-00000

第一列表示出现的次数，第二列表示匹配的字符串。

5.1.5 Web页面

参考附件A修改hosts文件：

在浏览器输入：app-12:8088就可以看到

可以看到两个步骤，一个grep-search，一个grep-sort

点击“Logs”

点击syslog

点击syslog，查看系统日志

5.1.6 关闭集群

关闭hadoop：stop-all.sh

查看状态：jps

还剩下zookeeper

关闭Zookeeper：./stopZookeeper.sh

查看状态：jps

关闭成功。

2020年 7月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

五、Hadoop3-大数据基础组件

5.1 基本集群环境搭建

5.1.1 集群拓扑图

5.1.2 集群搭建

5.1.2.1 app-11节点

5.1.2.1.2 安装hadoop

5.1.2.1.3 环境变量

5.1.2.2 app-12节点

5.1.2.3 app-13节点

5.1.2.4 启动journalnode

5.1.2.5 NameNode HA

5.1.2.5.1 app-11

5.1.2.5.2 app-12

5.1.3 启动集群

5.1.4 测试

5.1.5 Web页面

5.1.6 关闭集群

发表回复 取消回复

发表回复取消回复