Spark Application Evaluation and Performance Boosting Danqi Huang Hang Gong ABSTRACT Many existing Hadoop applications run much faster within Spark framework. RDDs are the primary API of Spark that provides low-level transformations and actions. Before Spark 2.0, RDDs are the primary pro- gramming interface of Spark. Different combinations of transformations and actions on RDDs can result in significant performance variance for the same work- load. Though having many advantages, sometimes the performance of RDDs is not satisfying. Compared with RDD, DataFrames bring space efficiency and running time benefits with the help of Catalyst and Tungsten optimization. GraphX is a component of Spark for graph-parallel computation. It started as a research project that aimed to unify graph-parallel and data-parallel systems and later became a stable part of Spark because it can bring performance boost for large-scale graph computation applications and can provide many useful graph ma- nipulation operators to make coding easier. In this paper, with implementations utilizing RDDs serving as the baseline, we want to show that 1)the inbuilt optimization engine of DataFrames can bring non-negligible performance improvement. 2)viewing a problem as a graph computation problem allows for specific optimizations achieved in engines like GraphX.