Brasil BRASIL
Contact
Year Publication
2021

SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction

Hermano. L. S. Lustosa, Anderson C. Silva, Daniel N. R. da Silva, Patrick Valduriez, Fabio Porto

Abstract:

Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling Declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications.

2020

Tactful Networking: Humans in the Communication Loop

Rafael Lima Costa, Aline Carneiro Viana, Artur Ziviani, Leobino Nascimento Sampaio

Abstract:

This survey discusses the human-perspective into networking through the Tactful Networking paradigm, whose goal is to add perceptive senses to the network by assigning it with human-like capabilities of observation, interpretation, and reaction to daily-life features and associated entities. To achieve this, knowledge extracted from inherent human behavior in terms of routines, personality, interactions, and others is leveraged, empowering the learning and prediction of user needs to improve QoE and system performance while respecting privacy and fostering new applications and services. Tactful Networking groups solutions from literature and innovative interdisciplinary human aspects studied in other areas. The paradigm is motivated by mobile devices’ pervasiveness and increasing presence as a sensor in our daily social activities. With the human element in the foreground, it is essential: (i) to center big data analytics around individuals; (ii) to create suitable incentive mechanisms for user participation; (iii) to design and evaluate both human-aware and system-aware networking solutions; and (iv) to apply prior and innovative techniques to deal with human-behavior sensing and learning. This survey reviews the human aspect in networking solutions through over a decade, followed by discussing the tactful networking impact through literature in behavior analysis and representative examples. This paper also discusses a framework comprising data management, analytics, and privacy for enhancing human raw-data to assist Tactful Networking solutions. Finally, challenges and opportunities for future research are presented.

2020

STConvS2S: Spational Convutional Sequence to Sequence Network for Weather Forecasting

Rafaela Castro, Yania M. Souto, Eduardo Ogasawara, Fabio Porto, Eduardo Bezerra

Applying machine learning models to meteorological data brings many opportunities to the Geosciences field, such as predicting future weather conditions more accurately. In recent years, modeling meteorological data with deep neural networks has become a relevant area of investigation. These works apply either recurrent neural networks (RNN) or some hybrid approach mixing RNN and convolutional neural networks (CNN). In this work, we propose STConvS2S (Spatiotemporal Convolutional Sequence to Sequence Network), a deep learning architecture built for learning both spatial and temporal data dependencies using only convolutional layers. Our proposed architecture resolves two limitations of convolutional networks to predict sequences using historical data: (1) they violate the temporal order during the learning process and (2) they require the lengths of the input and output sequences to be equal. Computational experiments using air temperature and rainfall data from South America show that our architecture captures spatiotemporal context and that it outperforms or matches the results of state-of-the-art architectures for forecasting tasks. In particular, one of the variants of our proposed architecture is 23% better at predicting future sequences and five times faster at training than the RNN-based model used as a baseline.

2020

A Survey of Biodiversity Informatics: Concepts, Practices and Challenges

Luiz M. R. Gadelha Jr., Pedro C. de Siracusa, Artur Ziviani, Eduardo Couto Dalcin, Helen Michelle Affe, Marinez Ferreira de Siqueira, Luís Alexandre Estevão da Silva, Douglas A. Augusto, Eduardo Krempser, Marcia Chame, Raquel Lopes Costa, Pedro Milet Meirelles, Fabiano ThompsonLuiz M. R. Gadelha Jr., Pedro C. de Siracusa, Artur Ziviani, Eduardo Couto Dalcin, Helen Michelle Affe, Marinez Ferreira de Siqueira, Luís Alexandre Estevão da Silva, Douglas A. Augusto, Eduardo Krempser, Marcia Chame, Raquel Lopes Costa, Pedro Milet Meirelles, Fabiano Thompson

Abstract:

The unprecedented size of the human population, along with its associated economic activities, have an ever increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide them. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision-makers in ways that they can effectively use them. The development and deployment of mechanisms to produce these indicators depend on having access to trustworthy data from field surveys and automated sensors, biological collections, molecular data, and historic academic literature. The transformation of this raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques used to manage and analyze this data comprise an area often called biodiversity informatics (or e-Biodiversity). Biodiversity data follows a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.

2020

You Shall not Pass: Avoiding Spurious Paths in Shortest-Path Based Centralities in Multidimensional Complex Networks

Klaus Wehmuth, Artur Ziviani, Leonardo Chinelate Costa, Ana Paula Couto da Silva, Alex Borges Vieira

Abstract:

In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view of such multidimensional (high order) networks. Consequently, these spurious paths may then cause shortest-path based centrality metrics to produce incorrect results, thus undermining the network centrality analysis. In this context, we propose a method able to avoid taking into account spurious paths when computing centralities based on shortest paths in multidimensional (or high order) networks. Our method is based on MultiAspect Graphs (MAG) to represent the multidimensional networks and we show that well-known centrality algorithms can be straightforwardly adapted to the MAG environment. Moreover, we show that, by using this MAG representation, pitfalls usually associated with spurious paths resulting from aggregation in multidimensional networks can be avoided at the time of the aggregation process. As a result, shortest-path based centralities are assured to be computed correctly for multidimensional networks, without taking into account spurious paths that could otherwise lead to incorrect results. We also present a case study that shows the impact of spurious paths in the computing of shortest paths and consequently of shortest-path based centralities, thus illustrating the importance of this contribution.


2020

An analysis of malaria in the Brazilian Legal Amazon using divergent association rules

Lais Baroni, Rebecca Salles, Samella Salles, Gustavo Guedes, Fabio Porto, Eduardo Bezerra, Christovam Barcellos, Marcel Pedroso, Eduardo Ogasawara

In data analysis, the mining of frequent patterns plays an important role in the discovery of associations and correlations between data. During this process, it is common to produce thousands of association rules (ARs), making the study of each one arduous. This problem weakens the process of finding useful information. There is a scientific effort to develop approaches capable of filtering interesting patterns, balancing the number of ARs produced with the goal of not being trivial and known by specialists. However, even when such approaches are adopted, the number of produced ARs can still be high. This work contributes by presenting Divergent Association Rules Approach (DARA), a novel approach for obtaining ARs that presents themselves in divergence with the data distribution. DARA is applied right after traditional approaches to filtering interesting patterns. To validate our approach, we studied the dataset related to the occurrence of malaria in the Brazilian Legal Amazon. The discovered patterns highlight that ARs brought relevant insights from the data. This article contributes both in the medical and computer science fields since this novel computational approach enabled new findings regarding malaria in Brazil.

2020

CoVeC: Coarse-Grained Vertex Clustering for Efficient Community Detection in Sparse Complex Networks

Gustavo S. Carnivali, Alex B. Vieira, Artur Ziviani, Paulo A. A. Esquef.

Abstract:

This paper tackles the problem of community detection in large-scale graphs. In the literature devoted to this topic, an iterative algorithm, called Louvain Method (LM), stands out as an effective and fast solution for this problem. However, the first iterations of the LM are the most costly. To overcome this issue, this paper introduces CoVeC, a Coarse-grained Vertex Clustering for efficient community detection in sparse complex networks. CoVeC pre-processes the original graph in order to forward a graph of reduced size to the LM. The subsequent group formation, including the maximization of group quality, as per the modularity metric, is left to the LM. We evaluate our proposal using real-world and synthetic networks, presenting distinct sizes and sparsity levels. Overall, our experimental results show that CoVeC can be a way faster option than the first iterations of the LM, yet similarly effective. In fact, for sparser graphs, the combo CoVeC+LM outperforms the standalone LM and its variations, attaining a mean processing time reduction of 47% and a mean modularity reduction of only 0.4%.

2020

BioinfoPortal: A scientific gateway for integrating bioinformatics applications on the Brazilian national high-performance computing network

Kary A.C.S.Ocaña, Marcelo Galheigo,Carla Osthoff, Luiz M.R. Gadelha Jr., Fabio Porto, Antônio Tadeu A.Gomes, Daniel de Oliveira, Ana Tereza Vasconcelosa

Abstract:

Science gateways have gained increasing attention in the last years from diverse communities. Science gateways are software solutions that bring out the integration of reusable data and specialized techniques via Web servers while hiding the complexity of the underlying high-performance computing resources. Several projects and initiatives have been started worldwide to develop frameworks that support the broad range of key scientific domains. Biological sciences are undergoing a revolution since novel technologies, such as next-generation sequencing, allow data generation in exascale dimensions. Bioinformatics covers a wide range of important applications in health, diversity, and life sciences with the understanding of the high-performance computing culture to accelerate the transition of computational simulations of biological systems at all scales. The article introduces the BioinfoPortal gateway, its architecture, functionalities, and the integration to the CSGrid middleware used to manage the high-performance computing environment of the Brazilian National High-Performance Computing System, SINAPAD, including the Santos Dumont supercomputer. We present a discussion about the challenges of integrating BioinfoPortal and CSGrid framework, which considers the general process of the installation, configuration, and deployment. Finally, we present the findings of the performance analysis of high-performance computing applications, presenting how machine learning was applied to optimize the functionality of BioinfoPortal based on recommending predictive models for the efficient allocation of resources obtained over 75% of performance efficiency.

2020

SUQ2 : Uncertainty Quantification Queries over Large Spatio-temporal Simulations

Noel Moreno Lemus, Fabio Porto, Yania M. Souto, Rafael S. Pereira, Ji Liu, Esther Pacciti, and Patrick Valduriez

Abstract:

The combination of high-performance computing towards Exascale power and numerical techniques enables exploring complex physical phenomena using large-scale spatio-temporal modeling and simulation. The improvements on the fidelity of phenomena simulation require more sophisticated uncertainty quantification analysis, leaving behind measurements restricted to low order statistical moments and moving towards more expressive probability density functions models of uncertainty. In this paper, we consider the problem of answering uncertainty quantification queries over large spatio-temporal simulation results. We propose the SUQ2 method based on the Generalized Lambda Distribution (GLD) function. GLD fitting is an embarrassingly parallel process that scales linearly to the number of available cores on the number of simulation points. Furthermore, the answer of queries is entirely based on computed GLDs and the corresponding clusters, which enables trading the huge amount of simulation output data by 4 values in the GLD parametrization per simulation point. The methodology presented in this paper becomes an important ingredient in converging simulations improvements to the Exascale computational power.


2020

Efficient Network Seeding under Variable Node Cost and Limited Budget for Social Networks

R. C. Souza, D. R. Figueiredo, A. A. A. Rocha, Artur Ziviani

Abstract:

The efficiency of information diffusion on networks highly depends on both the network structure and the set of early spreaders. Moreover, in various realistic scenarios, to seed different nodes implies different costs, as in the case of viral marketing, where costs often correlate with local network structure. The budgeted influence maximization (BIM) problem consists in determining a seed set whose diffusion maximizes the total number of influenced nodes, provided that the seeding cost is within a given budget. We investigate efficient seeding strategies for the BIM problem under the deterministic fixed threshold diffusion model. In particular, we introduce the concept of surrounding sets: relatively cheap seeds neighboring expensive, structurally-privileged nodes, which then become spreaders at lower costs. Numerical experiments with several real networks indicate our method outperforms strategies that seed nodes based on their influence/cost ratios. A key insight from our evaluation is that larger diffusion is generally attained from the surrounding sets that consider the two-hop neighborhood of influential nodes, as opposed to their immediate neighbors only.

2020

Parallel computation of PDFs on big spatial data using Spark

Ji Liu, Noel Moreno Lemus, Esther Pacitti, Fábio Porto, Patrick Valduriez

Abstract:

We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.



2020

New perspectives on analysing data from biological collections based on social network analytics

P. C. Siracusa, Luiz M. R. Gadelha Jr., Artur Ziviani

Abstract:

Biological collections have been historically regarded as fundamental sources of scientific information on biodiversity. They are commonly associated with a variety of biases, which must be characterized and mitigated before data can be consumed. In this work, we are motivated by taxonomic and collector biases, which can be understood as the effect of particular recording preferences of key collectors on shaping the overall taxonomic composition of biological collections they contribute to. In this context, we propose two network models as the first steps towards a network-based conceptual framework for understanding the formation of biological collections as a result of the composition of collectors’ interests and activities. Building upon the defined network models, we present a case study in which we use our models to explore the community of collectors and the taxonomic composition of the University of Brasília herbarium. We describe topological features of the networks and point out some of the most relevant collectors in the biological collection as well as their taxonomic groups of interest. We also investigate their collaborative behaviour while recording specimens. Finally, we discuss future perspectives for incorporating temporal and geographical dimensions to the models. Moreover, we indicate some possible investigation directions that could benefit from our approach based on social network analytics to model and analyse biological collections.

2019

Towards Optimizing the Execution of Spark Scientific Workflows Using Machine Learning based Parameter Tuning

Douglas de Oliveira, Fábio Porto, Cristina Boeres, Daniel de Oliveira

Summary


In the last few years, Apache Spark has become de facto the standard of big data framework on both industry and academy projects. Especially in the scientific domain, it is already used to execute compute- and data-intensive workflows from biology to astronomy. Although Spark is an easy-to-install framework, it has more than one hundred parameters to be set, besides specific application design parameters. In this way, to execute Spark-based workflows in an efficient manner, the user has to fine tune a myriad of Spark and workflow parameters (even the partitioning strategy, for instance). This configuration task cannot be manually performed in a trial-and-error way, since it is tedious and error-prone. This article proposes an approach that focuses on generating predictive machine learning models (i.e. decision trees), and then extract useful rules (i.e. patterns) from this model that can be applied to configure parameters of future executions of the workflow and Spark for non-experts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the methodology introduced here reduced the number of parameters to be configured, by identifying the most relevant ones related to the workflow performance in the predictive model.

2019

Graph-Based Skill Acquisition for Reinforcement Learning

Matheus R. F. Mendonça, Artur Ziviani, André M. S. Barreto

ACM Computing Surveys (CSUR), ISSN: 0360-0300, vol. 52, issue 1, article no. 6

Abstract

In machine learning, Reinforcement Learning (RL) is an important tool for creating intelligent agents that learn solely through experience. One particular subarea within the RL domain that has received great attention is how to define macro-actions, which are temporal abstractions composed of a sequence of primitive actions. This subarea, loosely called skill acquisition, has been under development for several years and has led to better results in a diversity of RL problems. Among the many skill acquisition approaches, graph-based methods have received considerable attention. This survey presents an overview of graph-based skill acquisition methods for RL. We cover a diversity of these approaches and discuss how they evolved throughout the years. Finally, we also discuss the current challenges and open issues in the area of graph-based skill acquisition for RL.

2018

Understanding Human Mobility and Workload Dynamics Due To Different Large-Scale Events Using Mobile Phone Data

Humberto T. Marques-Neto, Faber H. Z. Xavier, Wender Z. Xavier, Carlos Henrique S. Malab, Artur Ziviani, Lucas M. Silveira & Jussara M. Almeida

Journal of Network and Systems Management (JONS), Springer, ISSN: 1064-7570, vol. 26, no. 4, pp. 1079-1100,


Abstract:

The analysis of mobile phone data can help carriers to improve the way they deal with unusual workloads imposed by large-scale events. This paper analyzes human mobility and the resulting dynamics in the network workload caused by three different types of large-scale events: a major soccer match, a rock concert, and a New Year’s Eve celebration, which took place in a large Brazilian city. Our analysis is based on the characterization of records of mobile phone calls made around the time and place of each event. That is, human mobility and network workload are analyzed in terms of the number of mobile phone calls, their inter-arrival and inter-departure times, and their durations. We use heat maps to visually analyze the spatio-temporal dynamics of the movement patterns of the participants of the large-scale event. The results obtained can be helpful to improve the understanding of human mobility caused by large-scale events. Such results could also provide valuable insights for network managers into effective capacity management and planning strategies. We also present PrediTraf, an application built to help the cellphone carriers plan their infrastructure on large-scale events.


2018

BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

Maria Luiza Mondelli, Thiago Magalhães, Guilherme Loss, Michael Wilde, Ian Foster, Marta Mattoso, Daniel Katz, Helio Barbosa, Ana Tereza R. de Vasconcelos, Kary Ocaña, Luiz M.R. Gadelha Jr​

PeerJ, 6, e5551.


Abstract:

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

2017

GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis.

Raquel L. Costa​, Luiz Gadelha, Marcelo Ribeiro-Alves, Fábio Porto

PeerJ, 5, e3509.


Abstract:

There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.

2016

MobHet: Predicting Human Mobility Using Heterogeneous Data Sources

Lucas M. Silveira, Jussara M. de Almeida, Humberto T. Marques-Neto, Carlos Sarraute, Artur Ziviani

Computer Communications, Special issue on Mobile Traffic Analysis, Elsevier Science, ISSN: 0140-3664, vol. 95, pp. 54-68


Abstract:

The literature is rich in mobility models that aim at predicting human mobility. Yet, these models typically consider only a single kind of data source, such as data from mobile calls or location data obtained from GPS and web applications. Thus, the robustness and effectiveness of such data-driven models from the literature remain unknown when using heterogeneous types of data. In contrast, this paper proposes a novel family of data-driven models, called MobHet, to predict human mobility using heterogeneous data sources. Our proposal is designed to use a combination of features capturing the popularity of a region, the frequency of transitions between regions, and the contacts of a user, which can be extracted from data obtained from various sources, both separately and conjointly. We evaluate the MobHet models, comparing them among themselves and with two single-source data-driven models, namely SMOOTH and Leap Graph, while considering different scenarios with single as well as multiple data sources. Our experimental results show that our best MobHet model produces results that are better than or at least comparable to the best baseline in all considered scenarios, unlike the previous models whose performance is very dependent on the particular type of data used. Our results thus attest the robustness of our proposed solution to the use of heterogeneous data sources in predicting human mobility.


2016

On MultiAspect Graphs

Klaus Wehmuth, Éric Fleury, Artur Ziviani

Theoretical Computer Science (TCS), Elsevier Science, ISSN: 0304-3975, vol. 651, pp. 50-61


Abstract:

Different graph generalizations have been recently used in an ad-hoc manner to represent multilayer networks, i.e. systems formed by distinct layers where each layer can be seen as a network. Similar constructions have also been used to represent time-varying networks. We introduce the concept of MultiAspect Graph (MAG) as a graph generalization that we prove to be isomorphic to a directed graph, and also capable of representing all previous generalizations. In our proposal, the set of vertices, layers, time instants, or any other independent features are considered as an aspect of the MAG. For instance, a MAG is able to represent multilayer or time-varying networks, while both concepts can also be combined to represent a multilayer time-varying network and even other higher-order networks. Since the MAG structure admits an arbitrary (finite) number of aspects, it hence introduces a powerful modeling abstraction for networked complex systems. This paper formalizes the concept of MAG and derives theoretical results useful in the analysis of complex networked systems modeled using the proposed MAG abstraction. We also present an overview of the MAG applicability.


2016

A note on the complexity of the causal ordering problem

Bernardo Gonçalves, Fabio Porto

Abstract:

In this note we provide a concise report on the complexity of the causal ordering problem, originally introduced by Simon to reason about causal dependencies implicit in systems of mathematical equations. We show that Simon's classical algorithm to infer causal ordering is NP-Hard—an intractability previously guessed but never proven. We present then a detailed account based on Nayak's suggested algorithmic solution (the best available), which is dominated by computing transitive closure—bounded in time by O(|V||S|), where S(E,V) is the input system structure composed of a set E of equations over a set V of variables with number of variable appearances (density) |S|. We also comment on the potential of causal ordering for emerging applications in large-scale hypothesis management and analytics.


2015

Managing Scientific Hypotheses as Data with Support for Predictive Analytics

Bernardo Gonçalves, Fábio Porto

 Computing in Science and Engineering 17(5): 35-43 (2015)

Abstract:

The sheer scale of high-resolution raw data generated by simulation has motivated nonconventional approaches for data exploration, referred to as immersive and in situ query processing. Another step toward supporting scientific progress is to enable data-driven hypothesis management and predictive analytics out of simulation results. The authors of this article present a synthesis method and tool for encoding and managing competing hypotheses as uncertain data in a probabilistic database that can be conditioned in the presence of observations.

2015

BaMBa: towards the integrated management of Brazilian marine environmental data

Pedro Milet Meirelles, Luiz M. R. Gadelha, Jr, Ronaldo Bastos Francini-Filho, Rodrigo Leão de Moura, Gilberto Menezes Amado-Filho, Alex Cardoso Bastos, Rodolfo Pinheiro da Rocha Paranhos, Carlos Eduardo Rezende, Jean Swings, Eduardo Siegle, Nils Edvin Asp Neto, Sigrid Neumann Leitão, Ricardo Coutinho, Marta Mattoso, Paulo S. Salomon, Rogério A.B. Valle, Renato Crespo Pereira, Ricardo Henrique Kruger, Cristiane Thompson, Fabiano L. Thompson

Abstract:

A new open access database, Brazilian Marine Biodiversity (BaMBa) ( https://marinebiodiversity.lncc.br ), was developed in order to maintain large datasets from the Brazilian marine environment. Essentially, any environmental information can be added to BaMBa. Certified datasets obtained from integrated holistic studies, comprising physical–chemical parameters, -omics, microbiology, benthic and fish surveys can be deposited in the new database, enabling scientific, industrial and governmental policies and actions to be undertaken on marine resources. There is a significant number of databases, however BaMBa is the only integrated database resource both supported by a government initiative and exclusive for marine data. BaMBa is linked to the Information System on Brazilian Biodiversity (SiBBr, http://www.sibbr.gov.br/ ) and will offer opportunities for improved governance of marine resources and scientists’ integration.

2014

Y-DB: Managing scientific hypotheses as uncertain data

Bernardo Gonçalves, Fábio Porto

PVLDB 7(11): 959-962


Abstract:

In view of the paradigm shift that makes science ever more data-driven, we consider deterministic scientific hypotheses as uncertain data. This vision comprises a probabilistic database (p-DB) design methodology for the systematic construction and management of U-relational hypothesis DBs, viz., Υ-DBs. It introduces hypothesis management as a promising new class of applications for p-DBs. We illustrate the potential of Υ-DB as a tool for deep predictive analytics.

2014

Applying Provenance to Protect Attribution in Distributed Computational Scientific Experiments. In Provenance and Annotation of Data and Processes

Luiz M. R. Gadelha Jr., Marta Mattoso

IPAW 2014. Lecture
Notes in Computer Science, vol. 8628 (Vol. 8628, pp. 139–151). Springer.


Abstract:

The automation of large scale computational scientific experiments can be accomplished with the use of scientific workflow management systems, which allow for the definition of their activities and data dependencies. The manual analysis of the data resulting from their execution is burdensome, due to the usually large amounts of information. Provenance systems can be used to support this task since they gather details about the design and execution of these experiments. However, provenance information disclosure can also be seen as a threat to correct attribution, if the proper security mechanisms are not in place to protect it. In this article, we address the problem of providing adequate security controls for protecting provenance information taking into account requirements that are specific to e-Science. Kairos, a provenance security architecture, is proposed to protect both prospective and retrospective provenance, in order to reduce the risk of intellectual property disputes in computational scientific experiments.

2012

MTCProv: a practical provenance query framework for many-task scientific computing.

Luiz M. R. Gadelha Jr., Michael Wilde, Marta Mattoso & Ian Foster

Distributed and Parallel Databases, 30(5–6), 351–370.


Abstract:

Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such “many-task” computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.

2008

A Conceptual View on Trajectories

Stefano Spaccapietra, Christine Parent, Maria Luiza Damiani, José Antônio F. Macedo, Fábio Porto, Christelle Vangenot

Journal of Data and Knowledge Engineering, pp.126-146, ISSN:0169-023X, V(65)

Abstract:

Stefano Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.