Spark Getting started – Local development using eclipse

Spark Getting started – Local development using eclipse

This article will help you to jump start on spark development on your PC or laptop (Windows) without having a fully functional Hadoop cluster installed. I have a 8 GB RAM , 128 GB storage, Windows 10 machine. These days I try to isolate development in various environments using Docker containers or Bluemix containers. Still sometimes I fall back to method of developing stuff on my local machine before deploying the code to cluster. This blog covers Setting up spark and eclipse as IDE for local development with bare minimal prerequisites.

As of this writing, Spark 1.5.1 is available and I am using the same. Follow below instructions to set up spark on your machine.

Hadoop Installation on windows
    • Assuming your OS is windows, download and install Hadoop on windows.This may not be a fully functional Hadoop cluster but we are worried only about some libraries which spark will need later. Download Hadoop-2.6.0.tar.gz
    • You dont need to install anything, all you need to do is unpack (use 7zip free utility on windows to uncompress .gz file) the .gz file to a directory on your machine, preferably c:/hadoop
    • Set up environment variables – Create HADOOP_HOME and point to directory where you uncompressed hadoop files in above step.
    • Modify path variable to add $HADOOP_HOME\bin

  • We do not need a working hadoop cluster on our laptop to work on spark, so the installation mentioned here will not work as a fully functional hadoop cluster.
Checking JAVA configuration on your machine
  • Make sure java is available on your machine. Open a command prompt and type java -version
    javac -version
    . If you have version above 1.7 we are good with this step. Else install Java JDK from oracle downloads

Spark Installation
  • Download spark from here. As of this writing latest build is Spark 1.5.1. Choose release 1.5.1 packaged pre-built for Hadoop 2.6

    • Uncompress spark-1.4.1-bin-hadoop2.6.tgz file to a path on your machine, say c:/spark
    • Create a SPARK_HOME another environment variable pointing to the directory you unarchived
  • Append the path you uncompressed Spark to + \bin to your PATH environment variable PATH=$PATH:$SPARK_HOME\bin

With this much set up you should be able to try out spark using spark-shell , Spark-shell is a REPL(read evaluate print loop) for
working with spark interactively. To test it out

  • Open a command prompt and type spark-shell (spark-shell launches spark with scala prompt) , If you are a python enthusiast typepysparkto launch python shell

Spark context is default available as sc in spark-shell, you can try out sc.textFile and read a file to RDD

  1. val myData=sc.textFile(“File path”)
  2. myData.count()

Spark-shell is really useful to work interactively and learn basic things, but for better coding experience we need to rely on an IDE. Let us install and use eclipse for same, other option is itelligIDE.

Setting up Eclipse for spark and scala
  • Download and install eclipse. In this blog I am using eclipse Mars
  • Once eclipse is installed . Navigate to Help on menu bar and goto Eclipse market place. This is a single repository from where you can download and install plugins for eclipse.
  • On Marketplace search, type Scala and install the plugin
Set up Maven

Maven is a build tool to package your code, deal with dependencies etc. We need to install maven 3.3 or greater to work with spark and scala.Just like above installations maven is also a binary download, no real installation is needed, we just need to unpack it to a folder an set up a few environment variables.

  • Download Maven from downloads
  • Once maven is downloaded, unpack it to a folder in c: drive, say c:/MavenYou can have multiple version of Maven on your PC, but for our spark application you will have to use Maven 3.3 or greater.
  • Set up a new environment variable called MAVEN_HOME and set the variable value to c:/Maven. This is the directory you will see directories like bin, lib etc. Add $SCALA_HOME/bin to PATH variable

Maven help us with managing dependencies, most of the times Scala and Java program will need external JAR (library) to work. Maven will download and manage all dependencies for us. We will see, how to add dependencies later in this blog when we start writing simple Spark

Install SCALA

Scala is a functional / Object oriented programming language. Spark code can be done in Java/Scala or Python, If you are familiar with Java or Python this set up is optional.

  • Download scala from here. Version I am using is 2.10.4
  • Again Scala is a binary and we need to unpack files to a folder,c:/Scala and set up a new environment variable SCALA_HOME.
  • Add $SCALA_HOME/bin to PATH variable
  • Once done, check the version of Scala, open a commnad prompt and type Scala -version

With this step we are done with installation, now let us see how to write a simple spark program in eclipse.

Start coding Spark using eclipse
  • Open eclipse , if you are opening it for first time it may as for a workspace, set a folder from where you want to work and move on.
  • Once eclipse is opened up, navigate to File>New>Other

  • Navigate to Maven and click on Maven Project

  • Choose a location where to start your project,C:\LearnSpark

  • Choose a archetype to start project, Archetype is a Maven project template toolkit. Choose quick start in our case

  • Once you hit next , fill in Group ID as com.example and Artifact ID learnspark. Hit Finish

  • This will create a new folder structure for you in Project explorer. You can see icons M and J on Folder, this means a Maven Java project. This is because we are using Eclipse Mars for Java by default it is creating a Java project, if you want to work directly on Scalar option is to use SCALA IDE eclipse. We will change change the Java nature to Scala nature in our case. Click on File > Configure > Add Scala Nature

  • You can see folder icon changed from Java to Scala. Once done navigate to learnspark project root folder > com.example.learnspark > Click New > Scala Object

  • Create a new file and name it Basics

  • This will open a Scala file, type def main and hit Ctrl+Space this will give you a hint to what to type next, choose main from hint list

  • This will give a main block where you can start coding, Scala executes main first. Code println("I am ready to learnspark") in the main block and right click anywhere on page and navigate to Run As > Scala Application

  • So far we have not written spark code, in order to start Spark we need to add spark dependencies. This is where maven will come to play.
    While we created the maven project, in project folder pom.xml (Project Object Model) is auto created. Click on pom.xml and navigate to pom.xml tab as shown in image below.

  • This will open an XML file , in xml move to dependencies section and add below code

  • In above step we add a dependency to spark-core libraries. As of now our machine does not have this library. Maven will download dependencies for us.
  • Open a Command Prompt and navigate to the project folder. Type cd C:/learnspark
  • Do a dircommand, we can see project folder structure including pom.xml file.
  • Now type mvn clean install. This will download all dependend package and store in local machine. Dependency JAR can be found in .M2 repository where you have installed Maven as per our Maven Installation Step.
  • The above step might take a few minutes. Once install is successful, come back to eclipse.
  • If the above step is not done we will get a few errors, to see errors anytime, navigate to Problems.

  • You can see bunch of errors and warning, this is because we did not have needed dependencies. Oncemvn clean install is done, the errors should be gone.

  • If above step did not fix the issue, and if we still get error as above , Scala version needs to be changed , in order to do that Right
    Click on project root folder > Properties > Scala Compiler
  • Choose version 2.10.6 as that is the version of Scala supporting the Spark we downloaded

  • To work on Spark project we need some sample data. To organize data better right click on theproject root folder > New > Folder and name it Data

  • Right Click on new created folder and New > File > test.dat. You can use relative or absolute path to access files anywhere on your machine. For files on HDFS, you can use hdfs://server:port/file

  • Add below lines to test.dat file
1,First Row,John,Doe,DataBricks
2,Second Row,John,Smith,DataBricks
3,Third Row,Jane,Doe,DataBricks
  • We are all set to work on Spark, copy the below code to the Basics.scala. Code is commented for explanations.
package com.example.learnspark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
object Basics {
  def main(args: Array[String]): Unit = {
  println("I am ready to learn spark") //try out scala
  //setting up spark conf and master . Since it is local use Master as local
  // * denotes we use all core available. it can be changed to 2 or 4 depending
  //on core available
  val c SparkConf().setAppName("First spark App").setMaster("local[*]")
  //set sparkcontext , here we need to set up sparkConext in spark shell
  // sc is autoset and available
  val sc=new SparkContext(conf)
  //Read text file to RDD
  val myData=sc.textFile("data/test.dat")
  //Run Count on data
  • Right Click and Run the code, you should be able to see output printed on Console.

That’s it , all set. Please play around with Spark and let me know if you have any queries and comments.

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *