Spark Application Evaluation and Performance Boosting

Danqi Huang
Hang Gong

ABSTRACT

Many existing Hadoop applications run much faster within Spark
framework. RDDs are the primary API of Spark that provides low-level
transformations and actions. Before Spark 2.0, RDDs are the primary
pro- gramming interface of Spark. Different combinations of
transformations and actions on RDDs can result in significant
performance variance for the same work- load.

Though having many
advantages, sometimes the performance of RDDs is not
satisfying. Compared with RDD, DataFrames bring space efficiency and
running time benefits with the help of Catalyst and Tungsten
optimization.

GraphX is a component of Spark for graph-parallel
computation. It started as a research project that aimed to unify
graph-parallel and data-parallel systems and later became a stable
part of Spark because it can bring performance boost for large-scale
graph computation applications and can provide many useful graph ma-
nipulation operators to make coding easier.

In this paper, with implementations utilizing RDDs serving as the
baseline, we want to show that 1)the inbuilt optimization engine of
DataFrames can bring non-negligible performance improvement. 2)viewing
a problem as a graph computation problem allows for specific
optimizations achieved in engines like GraphX.