Efficient information diffusion in time-varying graphs through deep reinforcement learning
Machine Learning Approaches to Extreme Weather Events Forecast in Urban Areas: Challenges and Initial Results
Weather forecast services in urban areas face an increasingly hard task of alerting the population on extreme weather events. The hardness of the problem is due to the dynamics of the phenomenon, which challenges numerical weather prediction models and opens an opportunity for Machine Learning (ML) based models that may learn complex mappings between input-output from data. In this paper, we present an ongoing research project which aims at building ML predictive models for extreme precipitation forecast in urban areas, in particular in the Rio de Janeiro City. We present the techniques that we have been developing to improve rainfall prediction and extreme rainfall forecast, along with some initial experimental results. Finally, we discuss some challenges that remain to be tackled in this project.
DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models
Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.
You Shall not Pass: Avoiding Spurious Paths in Shortest-Path Based Centralities in Multidimensional Complex Networks
In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view of such multidimensional (high order) networks. Consequently, these spurious paths may then cause shortest-path based centrality metrics to produce incorrect results, thus undermining the network centrality analysis. In this context, we propose a method able to avoid taking into account spurious paths when computing centralities based on shortest paths in multidimensional (or high order) networks. Our method is based on MultiAspect Graphs (MAG) to represent the multidimensional networks and we show that well-known centrality algorithms can be straightforwardly adapted to the MAG environment. Moreover, we show that, by using this MAG representation, pitfalls usually associated with spurious paths resulting from aggregation in multidimensional networks can be avoided at the time of the aggregation process. As a result, shortest-path based centralities are assured to be computed correctly for multidimensional networks, without taking into account spurious paths that could otherwise lead to incorrect results. We also present a case study that shows the impact of spurious paths in the computing of shortest paths and consequently of shortest-path based centralities, thus illustrating the importance of this contribution.
SUQ2 : Uncertainty Quantification Queries over Large Spatio-temporal Simulations
Abstract:The combination of high-performance computing towards Exascale power and numerical techniques enables exploring complex physical phenomena using large-scale spatio-temporal modeling and simulation. The improvements on the fidelity of phenomena simulation require more sophisticated uncertainty quantification analysis, leaving behind measurements restricted to low order statistical moments and moving towards more expressive probability density functions models of uncertainty. In this paper, we consider the problem of answering uncertainty quantification queries over large spatio-temporal simulation results. We propose the SUQ2 method based on the Generalized Lambda Distribution (GLD) function. GLD fitting is an embarrassingly parallel process that scales linearly to the number of available cores on the number of simulation points. Furthermore, the answer of queries is entirely based on computed GLDs and the corresponding clusters, which enables trading the huge amount of simulation output data by 4 values in the GLD parametrization per simulation point. The methodology presented in this paper becomes an important ingredient in converging simulations improvements to the Exascale computational power.
Graph-Based Skill Acquisition for Reinforcement Learning
ACM Computing Surveys (CSUR), ISSN: 0360-0300, vol. 52, issue 1, article no. 6
In machine learning, Reinforcement Learning (RL) is an important tool
for creating intelligent agents that learn solely through experience.
One particular subarea within the RL domain that has received great
attention is how to define macro-actions, which are temporal
abstractions composed of a sequence of primitive actions. This subarea,
loosely called skill acquisition, has been under development for several
years and has led to better results in a diversity of RL problems.
Among the many skill acquisition approaches, graph-based methods have
received considerable attention. This survey presents an overview of
graph-based skill acquisition methods for RL. We cover a diversity of
these approaches and discuss how they evolved throughout the years.
Finally, we also discuss the current challenges and open issues in the
area of graph-based skill acquisition for RL.
Understanding Human Mobility and Workload Dynamics Due To Different Large-Scale Events Using Mobile Phone Data
Journal of Network and Systems Management (JONS), Springer, ISSN: 1064-7570, vol. 26, no. 4, pp. 1079-1100,
The analysis of mobile phone data can help carriers to improve the way they deal with unusual workloads imposed by large-scale events. This paper analyzes human mobility and the resulting dynamics in the network workload caused by three different types of large-scale events: a major soccer match, a rock concert, and a New Year’s Eve celebration, which took place in a large Brazilian city. Our analysis is based on the characterization of records of mobile phone calls made around the time and place of each event. That is, human mobility and network workload are analyzed in terms of the number of mobile phone calls, their inter-arrival and inter-departure times, and their durations. We use heat maps to visually analyze the spatio-temporal dynamics of the movement patterns of the participants of the large-scale event. The results obtained can be helpful to improve the understanding of human mobility caused by large-scale events. Such results could also provide valuable insights for network managers into effective capacity management and planning strategies. We also present PrediTraf, an application built to help the cellphone carriers plan their infrastructure on large-scale events.
Point pattern search in big data
SSDBM 2018: 21:1-21:12
Consider a set of points P in space with at least some of the pairwise distances specified. Given this set P, consider the following three kinds of queries against a database D of points : (i) pure constellation query: find all sets S in D of size |P| that exactly match the pairwise distances within P up to an additive error ϵ; (ii) isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f for which the distances between pairs in S exactly match f times the distances between corresponding pairs of P up to an additive ϵ; (iii) non-isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f and for at least some pairs of points, a maximum stretch factor mi,j > 1 such that (f X mi,jXdist(pi, pj))+ϵ > dist(si,sj) > (f X dist(pi, pj)) - ϵ. Finding matches to such queries has applications to spatial data in astronomical, seismic, and any domain in which (approximate, scale-independent) geometrical matching is required. Answering the isotropic and non-isotropic queries is challenging because scale factors and stretch factors may take any of an infinite number of values. This paper proposes practically efficient sequential and distributed algorithms for pure, isotropic, and non-isotropic constellation queries. As far as we know, this is the first work to address isotropic and non-isotropic queries.
BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments
PeerJ, 6, e5551.
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and
data-intensive, they require high-performance computing techniques and
can benefit from specialized technologies such as Scientific Workflow
Management Systems and databases. In this work, we present BioWorkbench,
a framework for managing and analyzing bioinformatics experiments. This
framework automatically collects provenance data, including both
performance data from workflow execution and data from the scientific
domain of the workflow application. Provenance data can be analyzed
through a web application that abstracts a set of queries to the
provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a
phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics
workflow; and RASflow, a RASopathy analysis workflow. We analyze each
workflow from both computational and scientific domain perspectives, by
using queries to a provenance and annotation database. Some of these
queries are available as a pre-built feature of the BioWorkbench web
application. Through the provenance data, we show that the framework is
scalable and achieves high-performance, reducing up to 98% of the case
studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process.
GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis.
PeerJ, 5, e3509.
There are many steps in analyzing transcriptome data, from the
acquisition of raw data to the selection of a subset of representative
genes that explain a scientific hypothesis. The data produced can be
represented as networks of interactions among genes and these may
additionally be integrated with other biological databases, such as
Protein-Protein Interactions, transcription factors and gene annotation.
However, the results of these analyses remain fragmented, imposing
difficulties, either for posterior inspection of results, or for
meta-analysis by the incorporation of new related data. Integrating
databases and tools into scientific workflows, orchestrating their
execution, and managing the resulting data and its respective metadata
are challenging tasks. Additionally, a great amount of effort is equally
required to run in-silico experiments to structure and compose the
information as needed for analysis. Different programs may need to be
applied and different files are produced during the experiment cycle. In
this context, the availability of a platform supporting experiment
execution is paramount. We present GeNNet, an integrated transcriptome
analysis platform that unifies scientific workflows with graph databases
for selecting relevant genes according to the evaluated biological
systems. It includes GeNNet-Wf, a scientific workflow that pre-loads
biological data, pre-processes raw microarray data and conducts a series
of analyses including normalization, differential expression inference,
clusterization and gene set enrichment analysis. A user-friendly web
interface, GeNNet-Web, allows for setting parameters, executing, and
visualizing the results of GeNNet-Wf executions. To demonstrate the
features of GeNNet, we performed case studies with data retrieved from
GEO, particularly using a single-factor experiment in different analysis
scenarios. As a result, we obtained differentially expressed genes for
which biological functions were analyzed. The results are integrated
into GeNNet-DB, a database about genes, clusters, experiments and their
properties and relationships. The resulting graph database is explored
with queries that demonstrate the expressiveness of this data model for
reasoning about gene interaction networks. GeNNet is the first platform
to integrate the analytical process of transcriptome data with graph
databases. It provides a comprehensive set of tools that would otherwise
be challenging for non-expert users to install and use. Developers can
add new functionality to components of GeNNet. The derived data allows
for testing previous hypotheses about an experiment and exploring new
ones through the interactive graph database environment. It enables the
analysis of different data on humans, rhesus, mice and rat coming from
Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.
MobHet: Predicting Human Mobility Using Heterogeneous Data Sources
Computer Communications, Special issue on Mobile Traffic Analysis, Elsevier Science, ISSN: 0140-3664, vol. 95, pp. 54-68
The literature is rich in mobility models that aim at predicting human mobility. Yet, these models typically consider only a single kind of data source, such as data from mobile calls or location data obtained from GPS and web applications. Thus, the robustness and effectiveness of such data-driven models from the literature remain unknown when using heterogeneous types of data. In contrast, this paper proposes a novel family of data-driven models, called MobHet, to predict human mobility using heterogeneous data sources. Our proposal is designed to use a combination of features capturing the popularity of a region, the frequency of transitions between regions, and the contacts of a user, which can be extracted from data obtained from various sources, both separately and conjointly. We evaluate the MobHet models, comparing them among themselves and with two single-source data-driven models, namely SMOOTH and Leap Graph, while considering different scenarios with single as well as multiple data sources. Our experimental results show that our best MobHet model produces results that are better than or at least comparable to the best baseline in all considered scenarios, unlike the previous models whose performance is very dependent on the particular type of data used. Our results thus attest the robustness of our proposed solution to the use of heterogeneous data sources in predicting human mobility.
On MultiAspect Graphs
Theoretical Computer Science (TCS), Elsevier Science, ISSN: 0304-3975, vol. 651, pp. 50-61
Different graph generalizations have been recently used in an ad-hoc manner to represent multilayer networks, i.e. systems formed by distinct layers where each layer can be seen as a network. Similar constructions have also been used to represent time-varying networks. We introduce the concept of MultiAspect Graph (MAG) as a graph generalization that we prove to be isomorphic to a directed graph, and also capable of representing all previous generalizations. In our proposal, the set of vertices, layers, time instants, or any other independent features are considered as an aspect of the MAG. For instance, a MAG is able to represent multilayer or time-varying networks, while both concepts can also be combined to represent a multilayer time-varying network and even other higher-order networks. Since the MAG structure admits an arbitrary (finite) number of aspects, it hence introduces a powerful modeling abstraction for networked complex systems. This paper formalizes the concept of MAG and derives theoretical results useful in the analysis of complex networked systems modeled using the proposed MAG abstraction. We also present an overview of the MAG applicability.
Database System Support of Simulation Data
2016): 1329-1340 (PVLDB 9(13)
Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row-store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.
A Unifying Model for Representing Time-Varying Graphs
IEEE International Conference on Data Science and Advanced Analytics - IEEE DSAA 2015, Paris, France
Graph-based models form a fundamental aspect of data representation in Data Sciences and play a key role in modeling complex networked systems. In particular, recently there is an ever-increasing interest in modeling dynamic complex networks, i.e. networks in which the topological structure (nodes and edges) may vary over time. In this context, we propose a novel model for representing finite discrete Time-Varying Graphs (TVGs), which are typically used to model dynamic complex networked systems. We analyze the data structures built from our proposed model and demonstrate that, for most practical cases, the asymptotic memory complexity of our model is in the order of the cardinality of the set of edges. Further, we show that our proposal is an unifying model that can represent several previous (classes of) models for dynamic networks found in the recent literature, which in general are unable to represent each other. In contrast to previous models, our proposal is also able to intrinsically model cyclic (i.e. periodic) behavior in dynamic networks. These representation capabilities attest the expressive power of our proposed unifying model for TVGs. We thus believe our unifying model for TVGs is a step forward in the theoretical foundations for data analysis of complex networked systems.
Managing Scientific Hypotheses as Data with Support for Predictive Analytics
Computing in Science and Engineering 17(5): 35-43 (
The sheer scale of high-resolution raw data generated by simulation has
motivated nonconventional approaches for data exploration, referred to
as immersive and in situ query processing. Another step toward
supporting scientific progress is to enable data-driven hypothesis
management and predictive analytics out of simulation results. The
authors of this article present a synthesis method and tool for encoding
and managing competing hypotheses as uncertain data in a probabilistic
database that can be conditioned in the presence of observations.
BaMBa: towards the integrated management of Brazilian marine environmental data
A new open access database, Brazilian Marine Biodiversity (BaMBa) ( https://marinebiodiversity.lncc.br
), was developed in order to maintain large datasets from the Brazilian
marine environment. Essentially, any environmental information can be
added to BaMBa. Certified datasets obtained from integrated holistic
studies, comprising physical–chemical parameters, -omics, microbiology,
benthic and fish surveys can be deposited in the new database, enabling
scientific, industrial and governmental policies and actions to be
undertaken on marine resources. There is a significant number of
databases, however BaMBa is the only integrated database resource both
supported by a government initiative and exclusive for marine data.
BaMBa is linked to the Information System on Brazilian Biodiversity
(SiBBr, http://www.sibbr.gov.br/ ) and will offer opportunities for improved governance of marine resources and scientists’ integration.
A Conceptual View on Trajectories
Journal of Data and Knowledge Engineering, pp.126-146, ISSN:0169-023X, V(65)
Stefano Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.