갈루아의 반서재

Ubuntu 19.04/18.04 그리고 Debian 9/8/10 에 Apache Spark 를 설치하는 방법에 대해 알아봅니다.

설치에 앞서 시스템 패키지를 업데이트합니다.

(AnnaM) founder@hilbert:~$ sudo apt -y upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done

(AnnaM) founder@hilbert:~$ sudo apt -y upgrade

 

Step 1: Install Java


Apache Spark 는 자바를 필요로 합니다. 어떤 버전의 자바가 설치되어 있는지 java -version 으로 확인해본 후, 설치된 것이 없으면 다음과 같이 설치한다. 우분투 자바 설치와 관련해서는 다음 포스팅을 참조한다. 

https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-ubuntu-18-04

 

How To Install Java with `apt` on Ubuntu 18.04 | DigitalOcean

In this guide, you will install various versions of the Java Runtime Environment (JRE) and the Java Developer Kit (JDK) using apt . You'll install OpenJDK as well as official packages from Oracle. You'll then select the version you wish to use for you

www.digitalocean.com

(AnnaM) founder@hilbert:~$ sudo apt install default-jdk

제대로 자바가 설치되었는지 확인해보자.

(AnnaM) founder@hilbert:~$ java -version
openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)


Step 2: Download Apache Spark


현시점 최신 버전은 2.4.4 이다. 해당 파일을 다음과 같이 다운로드받는다.

(AnnaM) founder@hilbert:~/annam$ curl -O https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  219M  100  219M    0     0  12.7M      0  0:00:17  0:00:17 --:--:-- 14.7M

다운로드받은 Spark 파일의 압축을 해제한다.

(AnnaM) founder@hilbert:~/annam$ tar xvf spark-2.4.4-bin-hadoop2.7.tgz
spark-2.4.4-bin-hadoop2.7/

압축을 푼 후 생성된 Spark 폴더를  /opt/spark 로 이동한다.

(AnnaM) founder@hilbert:~/annam$ sudo mv spark-2.4.4-bin-hadoop2.7/ /opt/spark

Spark 환경 설정을 한다. bashrc 설정 파일을 열어 다음의 내용을 추가한다.

(AnnaM) founder@hilbert:~/annam$ vim ~/.bashrc
export SPARK_HOME=/opt/spark 
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

변경사항을 반영한다.

(AnnaM) founder@hilbert:~/annam$ source ~/.bashrc

 

Step 3: Start a standalone master server


start-master.sh 명령어를 사용하여 다음과 같이 독립 마스터 서버를 구동할 수 있다.

(AnnaM) founder@hilbert:~/annam$ start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-founder-org.apache.spark.deploy.master.Master-1-hilbert.out

프로세스는 TCP 포트 8080 을 리스닝하게 된다.

(AnnaM) founder@hilbert:~/annam$ ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=5029,fd=246)) uid:1001 ino:35961 sk:d v6only:0 <->

웹 UI 는 다음과 같은 모습이다.

Spark URL 은 spark://**********:7077 이다.


Step 4: Starting Spark Worker Process


start-slave.sh 명령은 Spark Worker Process 를 가동하기 위해 사용된다.

(AnnaM) founder@hilbert:~/annam$ start-slave.sh spark://hilbert.asia-east1-b.c.alert-almanac-220207.internal:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-founder-org.apache.spark.deploy.worker.Worker-1-hilbert.out

만약 $PATH 에 스크립트가 없다면, 다음과 같이 검색해보자.

(AnnaM) founder@hilbert:~/annam$ locate start-slave.sh

/opt/spark/sbin/start-slave.sh 와 같이 절대주소를 이용해서도 스크립트 실행이 가능하다.

Step 5: Using Spark shell


spark-shell 명령을 사용하여 Spark Shell 에 접근할 수 있다. 

(AnnaM) founder@hilbert:~/annam$ /opt/spark/bin/spark-shell
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/11/01 09:17:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://hilbert.asia-east1-b.c.alert-almanac-220207.internal:4040
Spark context available as 'sc' (master = local[*], app id = local-1572599861434).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.4)
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("Hello Spark World")
Hello Spark World

파이썬이 편하다면, pyspark 를 사용하면 된다. 

(AnnaM) founder@hilbert:~/annam$ /opt/spark/bin/pyspark
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/11/01 09:21:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>>

다음 명령을 통해 마스터 / 슬레이브 Spark 를 종료할 수 있다. 

$ SPARK_HOME/sbin/stop-slave.sh 
$ SPARK_HOME/sbin/stop-master.sh 

원문소스 https://computingforgeeks.com/how-to-install-apache-spark-on-ubuntu-debian/