Towards Optimizing the Execution of Spark Scientific Workflows Using Machine Learning based Parameter Tuning
Published: 27-02-2019
Abstract:
In the last few years, Apache Spark has become de facto the standard of big data framework on both industry and academy projects. Especially in the scientific domain, it is already used to execute compute- and data-intensive workflows from biology to astronomy. Although Spark is an easy-to-install framework, it has more than one hundred parameters to be set, besides specific application design parameters. In this way, to execute Spark-based workflows in an efficient manner, the user has to fine tune a myriad of Spark and workflow parameters (even the partitioning strategy, for instance). This configuration task cannot be manually performed in a trial-and-error way, since it is tedious and error-prone. This article proposes an approach that focuses on generating predictive machine learning models (i.e. decision trees), and then extract useful rules (i.e. patterns) from this model that can be applied to configure parameters of future executions of the workflow and Spark for non-experts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the methodology introduced here reduced the number of parameters to be configured, by identifying the most relevant ones related to the workflow performance in the predictive model.