First (Virtual) Workshop of the HPDaSc Project

First (Virtual) Workshop of the HPDaSc Project

Event Location
Event Site

Hour: 9h-12h30 Rio de Janeiro, 14h-17h30 Montpellier

Workshop objective : focus on current joint work (between Brazil and France) and discuss progress

Workshop program

09:00 -09:30 (BR)/14:00-14:30 (FR) - Fabio Porto and Patrick Valduriez

Opening: project and workshop overview

09:30 - 9:50 (BR)/14:30 - 14:50 (FR) - Rafael Pereira, Alexis Joly, Fabio Porto

Title: Deep learning techniques on small data


As Machine Learning techniques become increasingly present both in scientific and industrial field, many different complex tasks are becoming solvable through algorithms that learn from data. Particularly, deep learning techniques can learn complex patterns represented by non linear functions from data structured in different kinds of formats, such as images, text, etc. However many of these methods are data hungry and lack generalization capability when optimized on small data. When considering a classification problem, traditional methods are only defined for closed set classification tasks, in which the optimization process is well defined for classes seen in the training set. In this presentation we discuss methods that constrain the hypothesis space the model transverses during optimization in order to improve generalization on small data. We also introduce approaches on how classification tasks can be defined for the open set problem, so that a model may be useful for classifying new classes

09:50- 10:10 (BR)/ 14:50-15:10(FR) - Heraldo Borges , Esther Pacitti, Florent Masseglia, Reza Akbarinia, Eduardo Ogasawara

Title: Spatial-Time Motifs Discovery


Discovering motifs in time series data has been widely explored. Various techniques have been developed to tackle this problem. However, when it comes to spatial-time series, a clear gap can be observed according to the literature review. This work tackles such a gap by presenting an approach to discover and rank motifs in spatial-time series, denominated Combined Series Approach (CSA). CSA is based on partitioning the spatial-time series into blocks. Inside each block, subsequences of spatial-time series are combined in a way that hash-based motif discovery algorithm is applied. Motifs are validated according to both temporal and spatial constraints. Later, motifs are ranked according to their entropy, the number of occurrences, and the proximity of their occurrences. The approach was evaluated using both synthetic and seismic datasets. CSA outperforms traditional methods designed only for time series. CSA was also able to prioritize motifs that were meaningful both in the context of synthetic data and also according to seismic specialists.

10:10- 10:30(BR)/15:10-15:30(FR) Gaetan Heidsieck, Daniel de Oliveira,  Esther Pacitti, Christophe Pradal,  François Tardieu and Patrick Valduriez

Title:  Cache-aware scheduling of scientific workflows in multisite cloud


We consider the efficient execution of  scientific workflows in multisite cloud, leveraging the heterogeneous resources available at multiple geo-distributed data centers. Since it is common for workflow users to reuse code or data from previous workflows, a promising approach for efficient workflow execution is to cache intermediate data in order to avoid re-executing entire workflows. However, caching intermediate data and scheduling workflows to exploit such caching in a multisite cloud with heterogeneous sites is complex.

In particular, workflow scheduling must be cache-aware, in order to decide whether reusing cached data or re-executing workflows entirely. In this work, we propose a solution for cache-aware scheduling of scientific workflows in multisite cloud. Our solution is based on a distributed and parallel architecture and includes new algorithms for adaptive caching, cache site selection, and dynamic workflow scheduling. We implemented our solution in the OpenAlea workflow system, together with cache-aware distributed scheduling algorithms. Our experimental evaluation in a three-site cloud with a real application in plant phenotyping shows that our solution can yield majors performance gains, reducing total time up to 42% with 60% of the same input data for each new execution.

10:30-10:50(BR)/15:30 - 15:50 (FR) -   Break

10:50-11:10(BR)/15:50 - 16:10(FR): Debora Pina, Liliane Neves, Daniel de Oliveira, Patrick Valduriez, Marta Mattoso:

Title: Using provenance for data analyses in physics informed neural networks

Scientific applications in Computational Science and Engineering (CSE) have been using Deep Neural Networks (DNNs) with an architecture that results in scientific data that respects the conservation laws of Physics. These are known as Physics-Informed Neural Networks (PINNs) also named Physics-constrained deep learning among other related technologies.

Tuning hyperparameters is time-consuming and relies on the experience of the DNN specialist. The process of hyperparameters’ tuning involves training different configurations and evaluating the results at each trial. These evaluations often require the association of different data, e.g. performance data, environment data, domain data and hyperparameters. Provenance data capture and storage can help in data analyses for these fine-tunings. We present provenance data services to be invoked by Keras/Tensorflow pipelines. The provenance database is compatible to standard representations helping interoperability and reproducibility. These services have been used in PINNs for forward and inverse problems governed by the Eikonal equation. The Eikonal equation often appears in problems including, but not limited to, geometric optics, shortest path problems, image segmentation, seismic and medical imaging. While there are efficient and stable techniques for solving the Eikonal equation for regular or arbitrary geometries in several dimensions, it remains a big challenge to solve inverse problems governed by this equation, especially when it comes to uncertainty quantification. The provenance database helps on several analyses without having to run the DNN under a specific framework or portal. Queries on the loss function data values help on evaluating epochs to fine-tune.

11:10 - 11:30(BR)/16:10-16:30(FR): Anderson Chaves, Patrick Valduriez, Fabio Porto:

TITLE: Extending SAVIME to Support ML Models


SAVIME is a multidimensional array in-memory database management system. It has been developed to support analytical queries over scientific data. It offers an extremely efficient ingestion procedure, which practically eliminates the waiting time to analyze incoming data. It also supports dense and sparse arrays and non-integer dimension indexing. It offers a functional query language processed by a query optimizer that generates efficient query execution plans. In this talk, in addition to a brief introduction to SAVIME, we will describe its extension to support the execution of machine learning models on data extracted from the database and its API to interface it with python transforming its TAR data structure into a numpy array. We compare the costs of using SAVIME for running predictions with a direct invocation on Python scripts. There are plenty of opportunities for future work, including: the use of ML models to improve SAVIME performance; to allow SAVIME subTAR (array partitions) to be distributed in a cluster of machines and extend the execution engine to process on distributed subTARs. Finally, we want to investigate the convergence of Linear Algebra operators and Array operators enabling model training and execution within SAVIME.

11:30-12:00(BR)/16:30 - 17:00 (FR):  Discussion - Workshop Closing


Zenith :  Esther Pacitti, Patrick Valduriez, Alexis Joly, Florent Masseglia, Reza Akbarinia, Christophe Pradal, Baldwin Dumontier, Oleksandra Levchenko, Jean-Christophe Lombardo;

PhD students: Gaetan Heidsieck, Benjamin Deneu, Lamia Djebour, Daniel Rosendo (Kerdata team), Alena Shilova (Cepage team)

LNCC : Fabio Porto, Kary Ocaña, Luiz Gadelha;

PhD students: Anderson Chaves, Gabriel Machado, Maria Luiza Modelli;

MSc students: Rafael Pereira

COPPE/UFRJ : Marta Mattoso, Alvaro Coutinho;

PosDoc: Renan Rocha;

PhD students: Debora Pina, Liliane Kunstmann Neves, Gabriel Barros, Romulo Silva;

UFF: Daniel de Oliveira, Aline Paes,Yuri Frota;

PhD students: Marcello Willians Messina, Carlos Gracioli, Raama Costa Luiz Gustavo Dias, Maria Luiza Falci;

CEFET-RJ : Eduardo Ogasawara, Rafaelli Coutinho;

PhD students: Heraldo Borges, Rebecca Salles, Lais Baroni;

Master  student:  Antonio Castro Jr