完整目录、平台简介、安装环境及版本:参考《Spark平台(精简版)概览》
六、Spark Python
6.1 Scala安装与启动
从https://www.scala-lang.org/files/archive/下载scala-2.11.6.tgz
解压缩:tar xvf scala-2.11.6.tgz
迁移到安装目录:sudo mv scala-2.11.6 /usr/local/scala
修改环境变量:sudo gedit ~/.bashrc
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
环境变量生效:source ~/.bashrc
查看安装效果:
6.2 Spark安装
从网站下载:http://archive.apache.org/dist/spark/spark-2.0.0/
spark版本:spark-2.0.0-bin-hadoop2.6.tgz
解压缩:tar xvf spark-2.0.0-bin-hadoop2.6.tgz
迁移到安装目录:sudo mv spark-2.0.0-bin-hadoop2.6 /usr/local/spark
修改环境变量:sudo gedit ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
环境变量生效:source ~/.bashrc
6.3 启动pyspark
命令 :pyspark
退出:exit()
cd /usr/local/spark/conf
cp log4j.properties.template log4j.properties.
编辑:sudo gedit log4j.properties
启动:pyspark
6.4 测试文本文件
ll ~/wordcount/input
启动集群:start-all.sh
6.4.1 本地及HDFS上运行pyspark
pyspark --master local[4]
sc.master
读取本地文件:/usr/local/spark/README.md
textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
读取HDFS文件:hdfs://master:9000/user/*/wordcount/input/LICENSE.txt
textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
退出安全模式:hadoop dfsadmin -safemode leave
故障
重新拷贝并上传新文件,不用之前做WORDCOUNT.JAVA试验的文件
重新运行:
textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
其中路径参考:
6.4.2 Yarn上运行pyspark
hduser@master:~$ HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
sc.master
textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
6.4.3 WEB查看
6.5 Spark standalone cluster运行环境
6.5.1 搭建
/usr/local/spark/conf
hduser@master:~$ cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
sudo gedit /usr/local/spark/conf/spark-env.sh
export SPARK_MASTER_IP=master
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
注意:注内存设置必须大于470M
6.5.2 拷贝到data1节点
ssh data1
sudo mkdir /usr/local/spark
设置权限:sudo chown hduser:hduser /usr/local/spark
退出:exit
远程拷贝:sudo scp -r /usr/local/spark hduser@data1:/usr/local/spark
6.5.3 拷贝到data2节点
ssh data2
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data2:/usr/local/spark
6.5.4 拷贝到data3节点
ssh data3
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data3:/usr/local/spark
6.5.5 编辑slaves文件
ll /usr/local/spark/conf
cp /usr/local/spark/conf/slaves.template /usr/local/spark/conf/slaves
编辑:sudo gedit /usr/local/spark/conf/slaves
6.5.6 运行
启动集群:/usrl/local/spark/sbin/start-all.sh
hduser@master:~$ pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
sc.master
本地文件:
textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
HDFS文件
textFile=sc.textFile('hdfs://master:9000/user/hduser*/wordcount/input/LICENSE.txt")
textFile.count()
故障:内存不能低于470多兆
6.5.7 页面
浏览器输入:http://192.168.0.50:8080