Data scientist with Spark


  • Intro
  • Hadoop and Spark (HDFS / RDD / MapReduce/ Dataframe)
  • Examples: Dataframe/ ML-Lib models on docker
  • ML-Lib pipeline

Scala is a general-purpose programming language. Scala source code is intended to be compiled to Java bytecode to run on a Java Virtual Machine (JVM) so Java libraries may be used directly in Scala. A large reason Scala demand has dramatically risen in recent years is because of Apache Spark. Let’s discuss what Spark is in the context of Big Data.

Data that can fit on a local computer, in a scale of 0–32 GB depending on RAM. what can we do if we have a larger set of data?

  • Try using a SQL database to move storage onto hard drive instead of RAM
  • Or use a distributed system, that distributes the data to multiple machines/computers.

A local process will use the computation resources of a single machine. A distributed process has access to the computational resources across a number of machines connected through a network. After a certain point, it is easier to scale out to many lower CPU machines, than to try to scale up to a single machine with high a CPU.


Hadoop is a way to distribute very large files across multiple machines. It uses the Hadoop Distributed File System (HDFS) which allows a user to work with large data sets. HDFS also duplicates blocks of data for fault tolerance. It also then uses MapReduce that allows computations on that data.

Distributed Storage — HDFS:

HDFS will use blocks of data, with a size of 128MB by default. Each of these blocks is replicated 3 times. The blocks are distributed in a way to support fault tolerance.

Smaller blocks provide more parallelization during processing. Multiple copies of a block prevent loss of data due to a failure of a node.


MapReduce is a way of splitting a computation task into a distributed set of files (such as HDFS). It consists of a Job Tracker and multiple Task Trackers. The Job Tracker sends code to run on the Task Trackers. The Task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes.

Big Data: Spark

Spark is one of the latest open source projects on Apache technologies being used to quickly and easily handle Big Data. You can think of Spark as a flexible alternative to MapReduce. Spark can use data stored in a variety of formats

○ Cassandra:



○ And more

Difference between Spark and MapReduce:

● MapReduce requires files to be stored in HDFS, Spark does not!

● Spark also can perform operations up to 100x faster than MapReduce

● MapReduce writes most data to disk after each map and reduces operation while Spark keeps most of the data in memory after each transformation. Spark can spill over to disk if the memory is filled

Spark- RDDs:

its mutable collection of the dataset, it stores in multiple boxes. At the core of Spark is the idea of a Resilient Distributed Dataset (RDD). Resilient Distributed Dataset (RDD) has 4 main features:

○ Distributed Collection of Data

○ Fault-tolerant

○ Parallel operation — partitioned

○ Ability to use many data sources

RDDs are immutable, lazily evaluated, and cacheable. There are two types of RDD operations:

  • Transformations
  • Actions

Transformations are basically a recipe to follow. Actions actually perform what the recipe says to do and returns something back.

When one RAM gets corrupted, spark use data from other distributed ram.


Spark DataFrames:

Dataframe work on top of RDDs, Spark DataFrames are also now the standard way of using Spark’s Machine Learning Capabilities. Let’s get a brief tour of the documentation (

Scala collection:


  • cluster + spark(Using Dockerization)
  • Azure Databricks
  • Databricks

Docker-based illustration:

  1. install Docker on your OS
  2. docker pull: Pull an image or a repository from a registry to Install spark on your docker
$ docker pull jupyter/pyspark-notebook$ docker run -it -p 8888:8888 jupyter/pyspark-notebook
pull all configuration required to run Pyspark with jupyter

Multiple API used in spark

Let's solve some example using spark

  1. Compute the value of PI using Monte Carlo Approach

2. Word Count using Spark

Note: 1. output save as folder, not file

2. rename each time outfut folder name

flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.

3. Spark DataFrame API

Some important blogs contain example and implementation of ML Models

Register the DataFrame as a SQL temporary view

4. Vectors and Correlation Coefficient

MLLib and ML Pipelines:

please find below the document for detail working process.

We are of the terminology used in ML-Pipeline mentioned in official documents

  • DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of
    data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and
    •Transformer: A Transformer is an algorithm that can transform one DataFrame into another DataFrame.
    E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with
    •Estimator: An Estimator is an algorithm that can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
    •Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
    •Parameter: All Transformers and Estimators now share a common API for specifying parameters.

Hyper-param tuning:

We use a ParamGridBuilder to construct a grid of parameters to search over. TrainValidationSplit will try all combinations of values and determine the best model using the evaluator.

Sometimes you will not find documents in python because spark ML-Lib still in the development stage. So go through scala or java document, use keywords or functions used in python spark.

Similarly, we can ent other models using the MLLib pipeline.


Leadership belief /Analyst(AI)