Spark Getting started - Develop using eclipse locally

This article will help you to jump start on spark development on your PC or laptop (Windows) without having a fully functional Hadoop cluster installed. I use a  8 GB RAM , 128 GB storage, Windows 10  machine. These days I try to isolate development in various environments using Docker containers or Bluemix containers. Still sometimes I fall back to method of developing stuff on my local machine before deploying the code to cluster. This blog covers Setting up spark and eclipse as IDE for local development with bare minimal prerequisites.

While I am writing this, Spark 1.5.1 is available and I am using the same.
Follow below instructions to set up spark on your machine.

Hadoop Installation on windows

1. Assuming your OS is windows, download and install Hadoop on windows. This may not be a fully functional Hadoop cluster but we are worried only about some libraries which spark will need later.
Download Hadoop-2.6.0.tar.gz  .
2. You dont need to install , all you need to do is uncompress (use 7 zip free utility on windows to uncompress .gz file) the .gz file to a directory on your machine, preferably c:/hadoop
3. Set up environment variables - Create HADOOP_HOME and point to directory where you uncompressed hadoop files in above step.
4. Modify path variable to add $HADOOP_HOME\bin

Creating a new env variable


Adding hadoop bin to path variable

  We do not need a working hadoop cluster on our laptop to work on spark, so the installation mentioned here will not work as a fully functional hadoop cluster.

Checking JAVA configuration on your machine

 1. Make sure java is available on your machine. Open a command prompt and type java -version and javac -version. If you have version above 1.7 we are good with this step. Else install Java JDK from oracle downloads



Spark Installation

1. Download spark from here. As of this writing latest build is Spark 1.5.1. Choose release 1.5.1 packaged pre-built for Hadoop 2.6



2. Uncompress  spark-1.4.1-bin-hadoop2.6.tgz file to a path on your machine, say c:/spark
3.Create a SPARK_HOME another environment variable pointing to the directory you unarchived
4. Append the path you uncompressed Spark to + \bin to your PATH environment variable (PATH=$PATH:$SPARK_HOME\bin)

With this much set up you should be able to try out spark using spark-shell , Spark-shell is a REPL(read evaluate print loop) for working with spark interactively.  To test it out
  • Open a command prompt and type spark-shell (spark-shell launches spark with scala prompt) , If you are a python enthusiast type pyspark to launch python shell



Spark context is default available as sc in spark-shell, you can try out sc.textFile and read a file to RDD
val myData=sc.textFile("File path")
myData.count()

Spark-shell is really useful to work interactively and learn basic things, but for better coding experience we may need to rely on a IDE. Let us install and use eclipse for same.

Setting up Eclipse for spark and scala

  • Download and install eclipse. In this blog I am using eclipse Mars
  •  Once eclipse is installed . Navigate to Help on menu bar and goto Eclipse market place. This is a single repository from where you can download and install plugins for eclipse.
  • On Marketplace search, type Scala and install the plugin

Set up Maven 

Maven is a build tool to package your code, deal with dependencies etc. We need to install maven 3.3 or greater to work with spark and scala.

Just like above installations maven is also a binary download, no real installation is needed,  we just need to unpack it to a folder an set up a few environment variables.

  • Download Maven from downloads - apache-maven-3.3.9-bin.tar.gz
  •  Once maven is downloaded, unpack it to a folder in c: drive, say c:/Maven . You can have multiple version of Maven on your PC,  but for our spark application you will have to use Maven 3.3 or greater.
  • Set up a new environment variable called MAVEN_HOME and set the variable value to c:/Maven  . This is the directory you will see directories like bin, lib etc. Add $SCALA_HOME/bin to path variable









Maven help us with managing dependencies, most of the times Scala and Java program will need  external JAR (library) to work. Maven will download and manage all dependencies for us. We will see, how to add dependencies later in this blog when we start writing simple Spark program.

Install SCALA

Scala is a functional / Object oriented programming language. Spark code  can be done in Java/Scala or Python, If you are familiar with Java or Python this set up is optional.

  • Download scala from here . Version I am using is 2.10.4
  • Again Scala is a binary and we need to unpack files to a folder, c:/Scala and set up a new environment variable SCALA_HOME.
  • Add $SCALA_HOME/bin to PATH variable
  • Once done, check the version of Scala, open a commnad prompt and type Scala -version
 

With this step we are done with installation, now let us see how to write a simple spark program in eclipse.

Start coding Spark in eclipse

  • Open eclipse , if you are opening it for first time it may as for a workspace, set a folder from where you want to work and move on.
  • Once eclipse is opened up, navigate to File>New>Other

 

  •  Navigate to Maven and click on Maven Project
  •   Choose a location where to start your project, C:\LearnSpark
  •  Choose a archetype to start project, Archetype is a Maven project template toolkit. Choose quick start in our case
  •  Once you hit next , fill in Group ID  as com.example and Artifact ID learnspark   . Hit Finish
  •  This will create a new folder structure for you in Project explorer. You can see icons M and J on Folder,  this means a Maven Java project. This is because we are using Eclipse Mars for Java  by default it is creating a Java project, if you want to work directly on Scalar option is to use SCALA IDE eclipse . We will change change the Java nature to Scala nature in our case. Click on File > Configure > Add Scala Nature

  •  You can see folder icon changed from Java to Scala. Once done navigate to learnspark project root folder > com.example.learnspark > Click New > Scala Object
  •  Create a new file and name it Basics
  •  This will open a Scala file, type def main and hit Ctrl+Space this will give you a hint to what to type next, choose main from hint list
  •  This will give a main block where you can start coding, Scala executes main first. Code    println("I am ready to learn spark") in the main block and right click Navigate to Run As > Scala Application
 
  • So far we have not written spark code, in order to start Spark we need to add basic dependencies. This is where maven will come to play. While we created the maven project, in project folder pom.xml (Project Object Model) is auto created. Click on pom.xml and navigate to pom.xml tab as shown in image below.
This will open an XML file , in xml move to dependencies section and add below code

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.10</artifactId>
            <version>1.5.1</version>
        </dependency>




  • In above step we add a dependency to spark-core libraries. As of now our machine does not have this library. Maven will download dependencies for us.
  • Open a Command Prompt and navigate to the project folder. Type cd C:/learnspark
  • Do a dir command, we can see project folder structure including pom.xml file.
  • Now type mvn clean install . This will download all dependent package and store in local machine. Dependency JAR can be found in .M2 repository where you have installed Maven as per our Maven Installation Step.
  • The above step might take a few minutes. Once install is successful, come back to eclipse.
  • If the above step is not done we will get a few errors, to see errors anytime,  navigate to Problems.
  •  You can see bunch of errors and warning, this is because we did not have needed dependencies. Once mvn clean install is done, the errors should be gone.

  • If above step did not fix the issue, and if we still get error as above , Scala version needs to be changed , in order to do that Right Click on project root folder > Properties > Scala Compiler
  • Choose version 2.10.6 as that is the version of Scala supporting the Spark we downloaded




 



  •  To work on Spark project we need some sample data. To organize it better  right click on the project root folder > New > Folder and name it Data
 

  •  Right Click on new created folder and New > File > test.dat



  •  Add below lines to test.dat file
1,First Row,John,Doe,DataBricks
2,Second Row,John,Smith,DataBricks
3,Third Row,Jane,Doe,DataBricks



  • We are all set to work on Spark, copy the below code to the Basics.scala. Code is commented.

package com.example.learnspark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Basics {
 def main(args: Array[String]): Unit = { 
   println("I am ready to learn spark") //try out scala
   //setting up spark conf and master . Since it is local use Master as local
   // * denotes we use all core available. it can be changed to 2 or 4 depending
   //on core available.
   val conf=new SparkConf().setAppName("First spark App").setMaster("local[*]")
   //set sparkcontext , here we need to set up sparkConext in spark shell
   // sc is autoset and available
   val sc=new SparkContext(conf)
   //Read text file to RDD
   val myData=sc.textFile("data/test.dat")
   //Run Count on data
   println(myData.count())
 }
}


  •   Right Click and Run the code, you should be able to see output printed on Console.

That's it , all set.  Please play around with Spark and let me know if you have any queries and comments. Same blog is published on etlcode.com, for quicker response use comments section over there.



Comments

Popular posts from this blog

Compose and Send HTML emails from Informatica (ETL)

How to build a datawarehouse on Hadoop

Update and Delete on Hive table (Hive supports CRUD)