Year Publication

Efficient information diffusion in time-varying graphs through deep reinforcement learning

Matheus R. F. Mendonça, André M. S. Barreto, Artur Ziviani


Network seeding for efficient information diffusion over time-varying graphs (TVGs) is a challenging task with many real-world applications. There are several ways to model this spatio-temporal influence maximization problem, but the ultimate goal is to determine the best moment for a node to start the diffusion process. In this context, we propose Spatio-Temporal Influence Maximization (STIM), a model trained with Reinforcement Learning and Graph Embedding over a set of artificial TVGs that is capable of learning the temporal behavior and connectivity pattern of each node, allowing it to predict the best moment to start a diffusion through the TVG. We focus on the scenario where some nodes in the TVG present periodic connectivity patterns, an aspect that received little attention in previous approaches. We also develop a special set of artificial TVGs used for training that simulate a stochastic diffusion process in TVGs, showing that the STIM network can learn an efficient policy even over a non-deterministic environment. After trained, STIM can be used in TVGs of any size, since the number of parameters of the model is independent to the size of the TVG being processed. STIM is also evaluated in two real-world TVGs, where it also manages to efficiently propagate information through the nodes. Finally, we also show that the STIM model has a time complexity of O(|E|). STIM is also highly versatile, where one can change the goal of the model by simply changing the adopted reward function.


Machine Learning Approaches to Extreme Weather Events Forecast in Urban Areas: Challenges and Initial Results

Fabio Porto, Mariza Ferro, Eduardo Ogasawara, Thiago Moeda, Claudio D. T. Barros, Anderson Chaves Silva, Rocio Zorrilla, Rafael S. Pereira, Rafaela Castro, João Victor Silva, Rebecca Salles, Augusto Fonseca, Juliana Hermsdorff, Marcelo Magalhães, Vitor Sá, Antonio Adolfo Simões, Carlos Cardoso, Eduardo Bezerra


Weather forecast services in urban areas face an increasingly hard task of alerting the population on extreme weather events. The hardness of the problem is due to the dynamics of the phenomenon, which challenges numerical weather prediction models and opens an opportunity for Machine Learning (ML) based models that may learn complex mappings between input-output from data. In this paper, we present an ongoing research project which aims at building ML predictive models for extreme precipitation forecast in urban areas, in particular in the Rio de Janeiro City. We present the techniques that we have been developing to improve rainfall prediction and extreme rainfall forecast, along with some initial experimental results. Finally, we discuss some challenges that remain to be tackled in this project.


DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models

Rafael S. Pereira, Yania Molina Souto, Anderson Silva, Rocio Zorrilla, Brian Tsan, Florin Rusu, Eduardo Ogasawara, Arthur Ziviani, Fabio Porto


Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.


You Shall not Pass: Avoiding Spurious Paths in Shortest-Path Based Centralities in Multidimensional Complex Networks

Klaus Wehmuth, Artur Ziviani, Leonardo Chinelate Costa, Ana Paula Couto da Silva, Alex Borges Vieira


In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view of such multidimensional (high order) networks. Consequently, these spurious paths may then cause shortest-path based centrality metrics to produce incorrect results, thus undermining the network centrality analysis. In this context, we propose a method able to avoid taking into account spurious paths when computing centralities based on shortest paths in multidimensional (or high order) networks. Our method is based on MultiAspect Graphs (MAG) to represent the multidimensional networks and we show that well-known centrality algorithms can be straightforwardly adapted to the MAG environment. Moreover, we show that, by using this MAG representation, pitfalls usually associated with spurious paths resulting from aggregation in multidimensional networks can be avoided at the time of the aggregation process. As a result, shortest-path based centralities are assured to be computed correctly for multidimensional networks, without taking into account spurious paths that could otherwise lead to incorrect results. We also present a case study that shows the impact of spurious paths in the computing of shortest paths and consequently of shortest-path based centralities, thus illustrating the importance of this contribution.


SUQ2 : Uncertainty Quantification Queries over Large Spatio-temporal Simulations

Noel Moreno Lemus, Fabio Porto, Yania M. Souto, Rafael S. Pereira, Ji Liu, Esther Pacciti, and Patrick Valduriez


The combination of high-performance computing towards Exascale power and numerical techniques enables exploring complex physical phenomena using large-scale spatio-temporal modeling and simulation. The improvements on the fidelity of phenomena simulation require more sophisticated uncertainty quantification analysis, leaving behind measurements restricted to low order statistical moments and moving towards more expressive probability density functions models of uncertainty. In this paper, we consider the problem of answering uncertainty quantification queries over large spatio-temporal simulation results. We propose the SUQ2 method based on the Generalized Lambda Distribution (GLD) function. GLD fitting is an embarrassingly parallel process that scales linearly to the number of available cores on the number of simulation points. Furthermore, the answer of queries is entirely based on computed GLDs and the corresponding clusters, which enables trading the huge amount of simulation output data by 4 values in the GLD parametrization per simulation point. The methodology presented in this paper becomes an important ingredient in converging simulations improvements to the Exascale computational power.


Graph-Based Skill Acquisition for Reinforcement Learning

Matheus R. F. Mendonça, Artur Ziviani, André M. S. Barreto

ACM Computing Surveys (CSUR), ISSN: 0360-0300, vol. 52, issue 1, article no. 6


In machine learning, Reinforcement Learning (RL) is an important tool for creating intelligent agents that learn solely through experience. One particular subarea within the RL domain that has received great attention is how to define macro-actions, which are temporal abstractions composed of a sequence of primitive actions. This subarea, loosely called skill acquisition, has been under development for several years and has led to better results in a diversity of RL problems. Among the many skill acquisition approaches, graph-based methods have received considerable attention. This survey presents an overview of graph-based skill acquisition methods for RL. We cover a diversity of these approaches and discuss how they evolved throughout the years. Finally, we also discuss the current challenges and open issues in the area of graph-based skill acquisition for RL.


Understanding Human Mobility and Workload Dynamics Due To Different Large-Scale Events Using Mobile Phone Data

Humberto T. Marques-Neto, Faber H. Z. Xavier, Wender Z. Xavier, Carlos Henrique S. Malab, Artur Ziviani, Lucas M. Silveira & Jussara M. Almeida

Journal of Network and Systems Management (JONS), Springer, ISSN: 1064-7570, vol. 26, no. 4, pp. 1079-1100,


The analysis of mobile phone data can help carriers to improve the way they deal with unusual workloads imposed by large-scale events. This paper analyzes human mobility and the resulting dynamics in the network workload caused by three different types of large-scale events: a major soccer match, a rock concert, and a New Year’s Eve celebration, which took place in a large Brazilian city. Our analysis is based on the characterization of records of mobile phone calls made around the time and place of each event. That is, human mobility and network workload are analyzed in terms of the number of mobile phone calls, their inter-arrival and inter-departure times, and their durations. We use heat maps to visually analyze the spatio-temporal dynamics of the movement patterns of the participants of the large-scale event. The results obtained can be helpful to improve the understanding of human mobility caused by large-scale events. Such results could also provide valuable insights for network managers into effective capacity management and planning strategies. We also present PrediTraf, an application built to help the cellphone carriers plan their infrastructure on large-scale events.


Point pattern search in big data

Fabio Porto, João N. Rittmeyer, Eduardo Ogasawara, Alberto Krone-Martins, Patrick Valduriez, Dennis Shasha

SSDBM 201821:1-21:12


Consider a set of points P in space with at least some of the pairwise distances specified. Given this set P, consider the following three kinds of queries against a database D of points : (i) pure constellation query: find all sets S in D of size |P| that exactly match the pairwise distances within P up to an additive error ϵ; (ii) isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f for which the distances between pairs in S exactly match f times the distances between corresponding pairs of P up to an additive ϵ; (iii) non-isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f and for at least some pairs of points, a maximum stretch factor mi,j > 1 such that (f X mi,jXdist(pi, pj))+ϵ > dist(si,sj) > (f X dist(pi, pj)) - ϵ. Finding matches to such queries has applications to spatial data in astronomical, seismic, and any domain in which (approximate, scale-independent) geometrical matching is required. Answering the isotropic and non-isotropic queries is challenging because scale factors and stretch factors may take any of an infinite number of values. This paper proposes practically efficient sequential and distributed algorithms for pure, isotropic, and non-isotropic constellation queries. As far as we know, this is the first work to address isotropic and non-isotropic queries.


BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

Maria Luiza Mondelli, Thiago Magalhães, Guilherme Loss, Michael Wilde, Ian Foster, Marta Mattoso, Daniel Katz, Helio Barbosa, Ana Tereza R. de Vasconcelos, Kary Ocaña, Luiz M.R. Gadelha Jr​

PeerJ, 6, e5551.


Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.


GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis.

Raquel L. Costa​, Luiz Gadelha, Marcelo Ribeiro-Alves, Fábio Porto

PeerJ, 5, e3509.


There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at and can be retrieved as a software container with the command docker pull quelopes/gennet.


MobHet: Predicting Human Mobility Using Heterogeneous Data Sources

Lucas M. Silveira, Jussara M. de Almeida, Humberto T. Marques-Neto, Carlos Sarraute, Artur Ziviani

Computer Communications, Special issue on Mobile Traffic Analysis, Elsevier Science, ISSN: 0140-3664, vol. 95, pp. 54-68


The literature is rich in mobility models that aim at predicting human mobility. Yet, these models typically consider only a single kind of data source, such as data from mobile calls or location data obtained from GPS and web applications. Thus, the robustness and effectiveness of such data-driven models from the literature remain unknown when using heterogeneous types of data. In contrast, this paper proposes a novel family of data-driven models, called MobHet, to predict human mobility using heterogeneous data sources. Our proposal is designed to use a combination of features capturing the popularity of a region, the frequency of transitions between regions, and the contacts of a user, which can be extracted from data obtained from various sources, both separately and conjointly. We evaluate the MobHet models, comparing them among themselves and with two single-source data-driven models, namely SMOOTH and Leap Graph, while considering different scenarios with single as well as multiple data sources. Our experimental results show that our best MobHet model produces results that are better than or at least comparable to the best baseline in all considered scenarios, unlike the previous models whose performance is very dependent on the particular type of data used. Our results thus attest the robustness of our proposed solution to the use of heterogeneous data sources in predicting human mobility.


On MultiAspect Graphs

Klaus Wehmuth, Éric Fleury, Artur Ziviani

Theoretical Computer Science (TCS), Elsevier Science, ISSN: 0304-3975, vol. 651, pp. 50-61


Different graph generalizations have been recently used in an ad-hoc manner to represent multilayer networks, i.e. systems formed by distinct layers where each layer can be seen as a network. Similar constructions have also been used to represent time-varying networks. We introduce the concept of MultiAspect Graph (MAG) as a graph generalization that we prove to be isomorphic to a directed graph, and also capable of representing all previous generalizations. In our proposal, the set of vertices, layers, time instants, or any other independent features are considered as an aspect of the MAG. For instance, a MAG is able to represent multilayer or time-varying networks, while both concepts can also be combined to represent a multilayer time-varying network and even other higher-order networks. Since the MAG structure admits an arbitrary (finite) number of aspects, it hence introduces a powerful modeling abstraction for networked complex systems. This paper formalizes the concept of MAG and derives theoretical results useful in the analysis of complex networked systems modeled using the proposed MAG abstraction. We also present an overview of the MAG applicability.


Database System Support of Simulation Data

Hermano Lustosa, Fabio Porto, Pablo Blanco, Patrick Valduriez

2016): 1329-1340 (PVLDB 9(13))


Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row-store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.


A Unifying Model for Representing Time-Varying Graphs

Klaus Wehmuth, Artur Ziviani, Eric Fleury

IEEE International Conference on Data Science and Advanced Analytics - IEEE DSAA 2015, Paris, France


Graph-based models form a fundamental aspect of data representation in Data Sciences and play a key role in modeling complex networked systems. In particular, recently there is an ever-increasing interest in modeling dynamic complex networks, i.e. networks in which the topological structure (nodes and edges) may vary over time. In this context, we propose a novel model for representing finite discrete Time-Varying Graphs (TVGs), which are typically used to model dynamic complex networked systems. We analyze the data structures built from our proposed model and demonstrate that, for most practical cases, the asymptotic memory complexity of our model is in the order of the cardinality of the set of edges. Further, we show that our proposal is an unifying model that can represent several previous (classes of) models for dynamic networks found in the recent literature, which in general are unable to represent each other. In contrast to previous models, our proposal is also able to intrinsically model cyclic (i.e. periodic) behavior in dynamic networks. These representation capabilities attest the expressive power of our proposed unifying model for TVGs. We thus believe our unifying model for TVGs is a step forward in the theoretical foundations for data analysis of complex networked systems.


Managing Scientific Hypotheses as Data with Support for Predictive Analytics

Bernardo Gonçalves, Fábio Porto

 Computing in Science and Engineering 17(5): 35-43 (2015)


The sheer scale of high-resolution raw data generated by simulation has motivated nonconventional approaches for data exploration, referred to as immersive and in situ query processing. Another step toward supporting scientific progress is to enable data-driven hypothesis management and predictive analytics out of simulation results. The authors of this article present a synthesis method and tool for encoding and managing competing hypotheses as uncertain data in a probabilistic database that can be conditioned in the presence of observations.


BaMBa: towards the integrated management of Brazilian marine environmental data

Pedro Milet Meirelles, Luiz M. R. Gadelha, Jr, Ronaldo Bastos Francini-Filho, Rodrigo Leão de Moura, Gilberto Menezes Amado-Filho, Alex Cardoso Bastos, Rodolfo Pinheiro da Rocha Paranhos, Carlos Eduardo Rezende, Jean Swings, Eduardo Siegle, Nils Edvin Asp Neto, Sigrid Neumann Leitão, Ricardo Coutinho, Marta Mattoso, Paulo S. Salomon, Rogério A.B. Valle, Renato Crespo Pereira, Ricardo Henrique Kruger, Cristiane Thompson, Fabiano L. Thompson


A new open access database, Brazilian Marine Biodiversity (BaMBa) ( ), was developed in order to maintain large datasets from the Brazilian marine environment. Essentially, any environmental information can be added to BaMBa. Certified datasets obtained from integrated holistic studies, comprising physical–chemical parameters, -omics, microbiology, benthic and fish surveys can be deposited in the new database, enabling scientific, industrial and governmental policies and actions to be undertaken on marine resources. There is a significant number of databases, however BaMBa is the only integrated database resource both supported by a government initiative and exclusive for marine data. BaMBa is linked to the Information System on Brazilian Biodiversity (SiBBr, ) and will offer opportunities for improved governance of marine resources and scientists’ integration.


A Conceptual View on Trajectories

Stefano Spaccapietra, Christine Parent, Maria Luiza Damiani, José Antônio F. Macedo, Fábio Porto, Christelle Vangenot

Journal of Data and Knowledge Engineering, pp.126-146, ISSN:0169-023X, V(65)


Stefano Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.