Spark平台(精简版五)Spark Python

完整目录、平台简介、安装环境及版本:参考《Spark平台(精简版)概览》

六、Spark Python

6.1 Scala安装与启动

从https://www.scala-lang.org/files/archive/下载scala-2.11.6.tgz

解压缩:tar xvf scala-2.11.6.tgz
迁移到安装目录:sudo mv scala-2.11.6 /usr/local/scala
修改环境变量:sudo gedit ~/.bashrc
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
环境变量生效:source ~/.bashrc

查看安装效果:

6.2 Spark安装

从网站下载:http://archive.apache.org/dist/spark/spark-2.0.0/

spark版本:spark-2.0.0-bin-hadoop2.6.tgz

解压缩:tar xvf spark-2.0.0-bin-hadoop2.6.tgz
迁移到安装目录:sudo mv spark-2.0.0-bin-hadoop2.6 /usr/local/spark
修改环境变量:sudo gedit ~/.bashrc
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
环境变量生效:source ~/.bashrc

6.3 启动pyspark

命令 :pyspark
退出:exit()
cd /usr/local/spark/conf
cp log4j.properties.template log4j.properties.
编辑:sudo gedit log4j.properties
启动:pyspark

6.4 测试文本文件

ll ~/wordcount/input
启动集群:start-all.sh

6.4.1 本地及HDFS上运行pyspark

pyspark --master local[4]
sc.master

读取本地文件:/usr/local/spark/README.md

textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()

读取HDFS文件:hdfs://master:9000/user/*/wordcount/input/LICENSE.txt

textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
退出安全模式:hadoop dfsadmin -safemode leave

故障

重新拷贝并上传新文件,不用之前做WORDCOUNT.JAVA试验的文件

重新运行:

textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()

其中路径参考:

6.4.2 Yarn上运行pyspark

hduser@master:~$ HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
sc.master
textFile=sc.textFile('hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")

textFile.count()

6.4.3 WEB查看

6.5 Spark standalone cluster运行环境

6.5.1 搭建

/usr/local/spark/conf
hduser@master:~$ cp /usr/local/spark/conf/spark-env.sh.template /usr/local/spark/conf/spark-env.sh
sudo gedit /usr/local/spark/conf/spark-env.sh
export SPARK_MASTER_IP=master
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=512m
export SPARK_WORKER_INSTANCES=4

注意:注内存设置必须大于470M

6.5.2 拷贝到data1节点

ssh data1
sudo mkdir /usr/local/spark
设置权限:sudo chown hduser:hduser /usr/local/spark
退出:exit
远程拷贝:sudo scp -r /usr/local/spark hduser@data1:/usr/local/spark

6.5.3 拷贝到data2节点

ssh data2
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data2:/usr/local/spark

6.5.4 拷贝到data3节点

ssh data3
sudo mkdir /usr/local/spark
sudo chown hduser:hduser /usr/local/spark
exit
sudo scp -r /usr/local/spark hduser@data3:/usr/local/spark

6.5.5 编辑slaves文件

ll /usr/local/spark/conf
cp /usr/local/spark/conf/slaves.template /usr/local/spark/conf/slaves
编辑:sudo gedit /usr/local/spark/conf/slaves

6.5.6 运行

启动集群:/usrl/local/spark/sbin/start-all.sh
hduser@master:~$ pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
sc.master

本地文件:

textFile=sc.textFile("file:/usr/local/spark/README.md")
textFile.count()

HDFS文件

textFile=sc.textFile('hdfs://master:9000/user/hduser*/wordcount/input/LICENSE.txt")
textFile.count()

故障:内存不能低于470多兆

6.5.7 页面

浏览器输入:http://192.168.0.50:8080

发表回复