Photo
Abstract

The cloud has become a good match for managing big data since it provides unlimited computing, storage and network resources on demand. By centralizing all data in a large-scale data-center, the cloud significantly simplifies the task of system administration. But for scientific data, where different organizations may have their own data-centers, a distributed (multisite) cloud model where each site is visible from outside, is needed. The main objective of this research and scientific collaboration is to develop a multisite cloud architecture for managing and analyzing scientific data, including support for heterogeneous data; distributed scientific workflows, and complex big data analysis. The resulting architecture will enable scalable data management infrastructures that can be used to host a variety of scientific applications that benefit from computing, storage, and networking resources that span multiple data-centers.

Objectives

The research challenge that we will confront is the design of new techniques for scientific data that must be done in a distributed and parallel manner, leveraging machines. In particular, the following issues will be investigated:

The approach is to capitalize on the principles of distributed and parallel data management. In particular,this study investigates the impact of intersite data movement in the performance/cost tradeoffs of our algorithms.

Numerical Simulation applications aim at producing a computer-based realistic simulation of phenomena. Typically, the application computes the values of variables of interest in space-time. Depending on the simulation precision and the domain to be simulated, the computation may take a very long time to compute and produce a huge amount of data. In this research, we aim at supporting simulation data analytics by providing efficient data management techniques to store, distribute and query simulation data. We have built SimDB on top of the multidimensional database system SciDB

Data locality has been successfully explored by the Map/Reduce paradigm and its most known open-source implementation Apache-Hadoop. More recently, the Apache Spark system proposed a richer language for dataflow specification and an in-memory data storage. Integrating existing scientific workflows into Spark is, however, difficult: (1) due to the specific programming languages adopted and (2) the use of file system I/O. In this work, we will investigate alternatives to integrate existing scientific workflows to Spark execution model.

We will validate our techniques by building software prototypes that exploit the expertise of the two teams with Spark, SciCumuls and modern DBMS (MonetDB and SciDB) . We will apply these techniques on real-world scientific data obtained from our application partners in astronomy, bioinformatics and computational engineering.

Publications

V. P. Freire, J. A. F. de Macedo, F. Porto, R. Akbarinia, NACluster: A Non-Supervised Clustering Algorithm for Matching Multi Catalogues, The 10th IEEE International Conference on e-Science,Oct , Guaruja, S.P., Brazil, 2014.

Liu, Ji. ; Silva, Vitor. ; Pacitti, Esther ; Valduriez, Patrick ; Mattoso, Marta . Scientific Workflow Partitioning in Multisite Cloud. In: 7th International Workshop on Multi/many-Core Computing Systems, 2014, Porto. EuroPar 2014.

Liu, Ji. ; Silva, Vitor. ; Pacitti, Esther ; Valduriez, Patrick ; Mattoso, Marta . Parallelization of Scientific Workflows in the Cloud. In: INRIA Research Report N° RR-8565 (2014).

Dias, J. ; G. Guerra, F. Rochinha, A. Coutinho, P. Valduriez, M. Mattoso. Data-Centric Iteration in Dynamic Workflows. Future Generation Computer Systems, Elsevier, Vol. 4, 114-126, (2015).

Liu, Ji. ; V. Silva, E. Pacitti, P. Valduriez, M. Mattoso. Parallelization of Scientific Workflows in the Cloud.Journal of Grid Computing, DOI 10.1007/s10723-015-9329-8, online march (2015).

Lutosa, H. ; F. Porto, R. Costa, P. Blanco, P. Valduriez. Managing Simulation Data with Multidimensional Arrays. Brazilian Symposium on Databases (SBBD), (2015).

Silva, V. ; D. de Oliveira, P. Valduriez, M. Mattoso. Analyzing Related Raw Data Files through Dataflows.Concurrency and Computation: Practice and Experience, to appear, (2015).

Souza, R. ; V. Silva, D. de Oliveira, P. Valduriez, A. Lima, M. Mattoso. Parallel Execution of Workflows Driven by a Distributed Database Management System. Int. Conf. For High Performance Computing, Networking, Storage and Analysis (SC15), (2015).

Team

French Team - INRIA

Esther Pacitti, University Montpellier 2, INRIA

Patrick Valduriez, INRIA, LIRMM

Reza Akbarinia, INRIA

Florin Masseglia, INRIA

Miguel Liroz-Gistau, INRIA

Ji Lui (PhD Student), INRIA

Saber Salah (PhD Student), INRIA

Maximilien Servajean (PhD Student), INRIA

Brazilian Team

Fabio Porto, LNCC

Marta Mattoso, COPPE-UFRJ

Alvaro Coutinho - COPPE-UFRJ

Daniel de Oliveira - UFF

Kary Ocaña - COPPE - UFRJ

Eduardo Ogasawara - CEFET-RJ

Flavio Costa (PhD Student COPPE - UFRJ)

Vitor Silva (PhD Student COPPE - UFRJ)

Douglas Ericson de Oliveira (PhD student LNCC)

Daniel Gaspar (PhD student LNCC)

Hermano Lustosa (PhD student LNCC)

Meetings

25TH AUGUST 2016 Workshop LNCC, Petrópolis - RJ

PARTICIPANTS:
Esther Pacitti (LIRM), Patrick Valduriez (INRIA - Zenith), Marta Mattoso (COPPE - UFRJ), Eduardo Ogasawara (CEFET-RJ), Daniel de Oliveira (UFF), Fabio Porto (LNCC), Kary Ocana (LNCC), Hermano Lustosa (LNCC), Noel Lemus (LNCC), Heraldo Borges (CEFET-RJ), Yania Souto (LNCC)

PROGRAM:
Fabio Porto - Opening and Project Status
Esther Pacitti (LIRM) - 15"
Marta Mattoso (COPPE - UFRJ) - 15"
Eduardo Ogasawara (CEFET-RJ) - 15"
Daniel de Oliveira (UFF) - 15"
New Project Proposal - (11:00 - 11:30)
Patrick Valduriez (INRIA - Zenith) - 30"

19TH AUGUST 2015 - UFF, Niterói - RJ

PARTICIPANTS:
Fabio Porto (LNCC), Marta Mattoso (COPPE-UFRJ), Patrick Valduriez (INRIA), Esther Paccitti (LIRMM-INRIA), Eduardo Ogasawara (CEFET-RJ), Daniel Oliveira (UFF), Kary Ocana (LNCC)

PRESENTATIONS:
Daniel Oliveira (UFF): Runtime Performance Monitoring Using Provenance and Domain-Specific Data
Vitor Silva (COPPE-UFRJ): Analyzing related raw data files through dataflows
Renan(COPPE-UFRJ): Controlling the Parallel Execution of Workflows Relying on a Distributed Database
Kary O'Kana (LNCC): Desenho e Execução de Experimentos de Bioinformática em Larga Escala: Experiências e Desafios em Aberto no LNCC
Eduardo Ogasawara (CEFET-RJ): Identifying Motifs in Spatio-temporal series
Amir Khatibi (LNCC): Unveiling Objects in Big Data
Hermano Lustosa (LNCC): Managing Numerical Simulation Data
Daniel Gaspar (LNCC): Optimizing Scientific Workflows

20th October 2014 - LNCC, Petrópolis - RJ

Title: Workshop - MUSIC - FAPERJ-INRIA
Coordination: Fabio Porto (LNCC), Esther Pacitti (LIRMM - INRIA)
Agenda

  • 8:50 -9:00 Opening
  • 9:00 -9:25 Patrick Valduriez (INRIA) - CloudMdsQL
    Talk: "CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language", by Patrick Valduriez (INRIA and LIRMM, Montpellier, France)
    Slides: CloudMdsQL
    The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. The CoherentPaaS project addresses this problem, by providing a common programming language and holistic coherence across different cloud data stores. In this talk, I will present the design of a Cloud Multi-datastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. Thus, CloudMdsQL unifies a quite diverse set of data management technologies while preserving the expressivity of their local query languages. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatabase query language.

  • 9:35 -10:05 Esther Pacitti (LIRMM - INRIA) - Profile Diversity for Citizen Sciences
    Slides: Profile Diversity for Citizen Sciences
    Profile Diversity for Citizen Sciences
    Esther Pacitti (Lirmm&Inria, University of Montpellier 2)

    Many scientific fields produce and consume a considerable amount of diverse data (e.g. biology , astronomy, physics) stored in different heterogeneous sites, and produced by different types of users profiles. We investigate two different use cases: a) In the domain of plant phenotyping, there has recently been increasing interests in finding diverse data coming from different research communities. b)In botany, the emergence of citizen sciences has fostered the creation of large and structured communities of nature observers. In this context, there is a need to retrieve diverse plant observations from a diverse spectrum of plant families, genus and species. In this talk I will present some new issues of profile diversity, a novel idea in searching and recommending scientific items (e.g. documents, images, datasets, etc), and how profile diversity can be deployed in different kinds of infrastructures (centralized and distributed).

  • 10:15 - 10:35 Eduardo Ogasawara (CEFET-RJ) Data Mining Algorithms and Applications
    Data Mining Algorithms and Applications
    Eduardo Ogasawara (CEFET-RJ)

    "The Music Project addresses many challenges related to basic and applied research concerning data science. More specifically, Data Mining Algorithms and Applications (DMAA) research involves the development of techniques, methodologies, models, algorithms, and architectures for data analysis. Our first track (DMAA#1) is related to our current research status in which Musicians may collaborate. DMAA#1 includes: (i) framework for educational data warehouse, (ii) identification of deny of service attacks, (iii) forecast of surface sea temperature, and (iv) temporal analysis of enterprise social networks"

  • 10:40 - 11:00 Daniel Oliveira (UFF) - Optimizing Resource Allocation for Workflow Execution in Multi-site Clouds
  • 11:05 - 11:25 Kary Ocana (UFRJ-COPPE) - Managing scientific workflows for bioinformatics experiments: challenges and learned lessons.
    Provenance in bioinformatics scientific workflows on clouds: challenges and lessons learned
    Kary Ocaña (UFRJ-COPPE)

    With the rapid development in recent years of high-throughput technologies (NGS, Next-generation equencing), biological big data is being generated and stored in biological databases. We address the challenges of designing bioinformatics scientific workflows for supporting the treatment of this vast quantity of data and to infer evolutionary, phylogenetic or genomic processes. Scientific workflows are managed by Scientific Workflow Management Systems (SWfMS) and efficiently executed in High Performance Computing (HPC) environments. Despite significant advances in computing capacity and performance, an analysis of these large-scale data in a search for biomedically relevant patterns remains a challenging task. To address this issue, we introduce the platform BioSciCumulus that provides a set of bioinformatics scientific workflows developed by our group in COPPE-UFRJ with NACAD-UFRJ and UFF collaborators. It is built on top of SciCumulus engine and deployed on the Amazon AWS environment. BioSciCumuluswas tested with many complex biological data sources and computational tools, as commonly found in bioinformatics. BioSciCumulus workflow applications support advanced resource monitoring, management strategies and a knowledge management strategy that manage at runtime data-domain provenance consults, in order to guarantee their performance goals, their successful completion and the usability of data-domain provenance for effectively support the extraction of the biological information.

  • 11:30 - 11:50 - Hermano Lustosa (LNCC)l SimDB - Managing Numerical Simulation Data with SciDB
    SimDB - Managing Numerical Simulation Data with SciDB
    Hermano Lustosa (LNCC)

    Computational Modelling is a science wherein scientists conceive mathematical models that strive to reproduce the behavior of a studied phenomenon. By means of simulations, predicting variables are computed, usually, through a multidimensional space-time frame. State of the art simulations use a class of software named Solvers to solve equations and compute the values of predicting variables. Moreover, to guide the computation in the physical domain space, polygonal meshes pinpoint the spots where values get computed and the time dimension brings the system dynamics by indicating successive values for the same mesh spot. Finally, in order to test the model with different parameter sets, the scientist may run thousands of simulations. Despite the huge amount of data produced by such simulations,the process is basically not supported by an efficient data management solution. Typical implementation stores parameters and simulated data in standard files organized with a sort of directory structure with no support for high level query language and distribution query processing.

    In this context, this talk investigates the adoption of the Multidimensional Array Data Model over the SciDB DBMS to manage multidimensional numerical simulation data. We model the 3D spatio-temporal dimensions and the simulation as indexes in the Multidimensional Array Data Model and the predicted variables as values in cells. We present a new strategy to map unstructured 3D spatial meshes into arrays. An orchestrated set of spatial transformations map the original spatial model into a dense multidimensional array, radically reducing the number of sparse chunks produced by a naive mapping. Our strategy is particularly interesting for large queries that would retrieve a huge number of sparsely loaded data chunks. We have run a series of experiments over a real case scenario for the simulation of the cardio-vascular system, developed at LNCC. We show that in some queries we present an improvement in query elapsed-time of approximately 25 times, compared to the standard SciDB implementation.

  • 12:00 - 12:30 - individual working Meeting
  • 12:30 - 13:00 Lunch
  • 13:00 - 15:00 - individual working meetings.

8th August 2014 - COPPE-UFRJ, Rio de janeiro

PRESENTATIONS:
Esther Pacitti
Fabio Porto
Marta Mattoso
Daniel de Oliveira
Kary Ocaña

8th May 2014 - COPPE-UFRJ, Rio de janeiro

PRESENTATIONS:
Opening
Vitor Silva
Hermano Lustosa
Daniel Oliveira
Douglas Ericson de Oliveira
Esther Pacitti

17th March 2014 - Fabio visit INRIA-Montpellier

Contact

Email Us

×