Summary
In the last few years, Apache Spark has
become de facto the standard of big data framework on both industry and
academy projects. Especially in the scientific domain, it is already
used to execute compute- and data-intensive workflows from biology to
astronomy. Although Spark is an easy-to-install framework, it has more
than one hundred parameters to be set, besides specific application
design parameters. In this way, to execute Spark-based workflows in an
efficient manner, the user has to fine tune a myriad of Spark and
workflow parameters (even the partitioning strategy, for instance). This
configuration task cannot be manually performed in a trial-and-error
way, since it is tedious and error-prone. This article proposes an
approach that focuses on generating predictive machine learning models
(i.e. decision trees), and then extract useful rules (i.e. patterns)
from this model that can be applied to configure parameters of future
executions of the workflow and Spark for non-experts users. In the
experiments presented in this article, the proposed parameter
configuration approach led to better performance in processing Spark
workflows. Finally, the methodology introduced here reduced the number
of parameters to be configured, by identifying the most relevant ones
related to the workflow performance in the predictive model.