Efficient information diffusion in time-varying graphs through deep reinforcement learning
Machine Learning Approaches to Extreme Weather Events Forecast in Urban Areas: Challenges and Initial Results
Weather forecast services in urban areas face an increasingly hard task of alerting the population on extreme weather events. The hardness of the problem is due to the dynamics of the phenomenon, which challenges numerical weather prediction models and opens an opportunity for Machine Learning (ML) based models that may learn complex mappings between input-output from data. In this paper, we present an ongoing research project which aims at building ML predictive models for extreme precipitation forecast in urban areas, in particular in the Rio de Janeiro City. We present the techniques that we have been developing to improve rainfall prediction and extreme rainfall forecast, along with some initial experimental results. Finally, we discuss some challenges that remain to be tackled in this project.
Emergence and algorithmic information dynamics of systems and observers
Algorithmic Information Distortions in Node-Aligned and Node-Unaligned Multidimensional Networks
In this article, we investigate limitations of importing methods based on algorithmic information theory from monoplex networks into multidimensional networks (such as multilayer networks) that have a large number of extra dimensions (i.e., aspects). In the worst-case scenario, it has been previously shown that node-aligned multidimensional networks with non-uniform multidimensional spaces can display exponentially larger algorithmic information (or lossless compressibility) distortions with respect to their isomorphic monoplex networks, so that these distortions grow at least linearly with the number of extra dimensions. In the present article, we demonstrate that node-unaligned multidimensional networks, either with uniform or non-uniform multidimensional spaces, can also display exponentially larger algorithmic information distortions with respect to their isomorphic monoplex networks. However, unlike the node-aligned non-uniform case studied in previous work, these distortions in the node-unaligned case grow at least exponentially with the number of extra dimensions. On the other hand, for node-aligned multidimensional networks with uniform multidimensional spaces, we demonstrate that any distortion can only grow up to a logarithmic order of the number of extra dimensions. Thus, these results establish that isomorphisms between finite multidimensional networks and finite monoplex networks do not preserve algorithmic information in general and highlight that the algorithmic information of the multidimensional space itself needs to be taken into account in multidimensional network complexity analysis.
SAVIME: An Array DBMS for Simulation Analysis and ML Models Prediction
Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling Declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications.
Tactful Networking: Humans in the Communication Loop
This survey discusses the human-perspective into networking through the Tactful Networking paradigm, whose goal is to add perceptive senses to the network by assigning it with human-like capabilities of observation, interpretation, and reaction to daily-life features and associated entities. To achieve this, knowledge extracted from inherent human behavior in terms of routines, personality, interactions, and others is leveraged, empowering the learning and prediction of user needs to improve QoE and system performance while respecting privacy and fostering new applications and services. Tactful Networking groups solutions from literature and innovative interdisciplinary human aspects studied in other areas. The paradigm is motivated by mobile devices’ pervasiveness and increasing presence as a sensor in our daily social activities. With the human element in the foreground, it is essential: (i) to center big data analytics around individuals; (ii) to create suitable incentive mechanisms for user participation; (iii) to design and evaluate both human-aware and system-aware networking solutions; and (iv) to apply prior and innovative techniques to deal with human-behavior sensing and learning. This survey reviews the human aspect in networking solutions through over a decade, followed by discussing the tactful networking impact through literature in behavior analysis and representative examples. This paper also discusses a framework comprising data management, analytics, and privacy for enhancing human raw-data to assist Tactful Networking solutions. Finally, challenges and opportunities for future research are presented.
STConvS2S: Spational Convutional Sequence to Sequence Network for Weather Forecasting
Applying machine learning models to meteorological data brings many opportunities to the Geosciences field, such as predicting future weather conditions more accurately. In recent years, modeling meteorological data with deep neural networks has become a relevant area of investigation. These works apply either recurrent neural networks (RNN) or some hybrid approach mixing RNN and convolutional neural networks (CNN). In this work, we propose STConvS2S (Spatiotemporal Convolutional Sequence to Sequence Network), a deep learning architecture built for learning both spatial and temporal data dependencies using only convolutional layers. Our proposed architecture resolves two limitations of convolutional networks to predict sequences using historical data: (1) they violate the temporal order during the learning process and (2) they require the lengths of the input and output sequences to be equal. Computational experiments using air temperature and rainfall data from South America show that our architecture captures spatiotemporal context and that it outperforms or matches the results of state-of-the-art architectures for forecasting tasks. In particular, one of the variants of our proposed architecture is 23% better at predicting future sequences and five times faster at training than the RNN-based model used as a baseline.
A Survey of Biodiversity Informatics: Concepts, Practices and Challenges
The unprecedented size of the human population, along with its associated economic activities, have an ever increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide them. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision-makers in ways that they can effectively use them. The development and deployment of mechanisms to produce these indicators depend on having access to trustworthy data from field surveys and automated sensors, biological collections, molecular data, and historic academic literature. The transformation of this raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques used to manage and analyze this data comprise an area often called biodiversity informatics (or e-Biodiversity). Biodiversity data follows a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges.
You Shall not Pass: Avoiding Spurious Paths in Shortest-Path Based Centralities in Multidimensional Complex Networks
In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view of such multidimensional (high order) networks. Consequently, these spurious paths may then cause shortest-path based centrality metrics to produce incorrect results, thus undermining the network centrality analysis. In this context, we propose a method able to avoid taking into account spurious paths when computing centralities based on shortest paths in multidimensional (or high order) networks. Our method is based on MultiAspect Graphs (MAG) to represent the multidimensional networks and we show that well-known centrality algorithms can be straightforwardly adapted to the MAG environment. Moreover, we show that, by using this MAG representation, pitfalls usually associated with spurious paths resulting from aggregation in multidimensional networks can be avoided at the time of the aggregation process. As a result, shortest-path based centralities are assured to be computed correctly for multidimensional networks, without taking into account spurious paths that could otherwise lead to incorrect results. We also present a case study that shows the impact of spurious paths in the computing of shortest paths and consequently of shortest-path based centralities, thus illustrating the importance of this contribution.
An analysis of malaria in the Brazilian Legal Amazon using divergent association rules
In data analysis, the mining of frequent patterns plays an important role in the discovery of associations and correlations between data. During this process, it is common to produce thousands of association rules (ARs), making the study of each one arduous. This problem weakens the process of finding useful information. There is a scientific effort to develop approaches capable of filtering interesting patterns, balancing the number of ARs produced with the goal of not being trivial and known by specialists. However, even when such approaches are adopted, the number of produced ARs can still be high. This work contributes by presenting Divergent Association Rules Approach (DARA), a novel approach for obtaining ARs that presents themselves in divergence with the data distribution. DARA is applied right after traditional approaches to filtering interesting patterns. To validate our approach, we studied the dataset related to the occurrence of malaria in the Brazilian Legal Amazon. The discovered patterns highlight that ARs brought relevant insights from the data. This article contributes both in the medical and computer science fields since this novel computational approach enabled new findings regarding malaria in Brazil.
CoVeC: Coarse-Grained Vertex Clustering for Efficient Community Detection in Sparse Complex Networks
This paper tackles the problem of community detection in large-scale
graphs. In the literature devoted to this topic, an iterative algorithm,
called Louvain Method (LM), stands out as an effective and fast
solution for this problem. However, the first iterations of the LM are
the most costly. To overcome this issue, this paper introduces CoVeC, a Coarse-grained Vertex Clustering
for efficient community detection in sparse complex networks.
CoVeC pre-processes the original graph in order to forward a graph of
reduced size to the LM. The subsequent group formation, including the
maximization of group quality, as per the modularity metric, is left to
the LM. We evaluate our proposal using real-world and synthetic
networks, presenting distinct sizes and sparsity levels. Overall, our
experimental results show that CoVeC can be a way faster option than the
first iterations of the LM, yet similarly effective. In fact, for
sparser graphs, the combo CoVeC+LM outperforms the standalone LM and its
variations, attaining a mean processing time reduction of 47% and a
mean modularity reduction of only 0.4%.
BioinfoPortal: A scientific gateway for integrating bioinformatics applications on the Brazilian national high-performance computing network
Science gateways have gained increasing attention in the last years from
diverse communities. Science gateways are software solutions that bring
out the integration of reusable data and specialized techniques via Web
servers while hiding the complexity of the underlying high-performance
computing resources. Several projects and initiatives have been started
worldwide to develop frameworks that support the broad range of key
scientific domains. Biological sciences are undergoing a revolution
since novel technologies, such as next-generation sequencing, allow data
generation in exascale dimensions. Bioinformatics covers a wide range
of important applications in health, diversity, and life sciences with
the understanding of the high-performance computing culture to
accelerate the transition of computational simulations of biological
systems at all scales. The article introduces the BioinfoPortal gateway, its architecture, functionalities, and the integration to the CSGrid
middleware used to manage the high-performance computing environment of
the Brazilian National High-Performance Computing System, SINAPAD,
including the Santos Dumont supercomputer. We present a discussion about
the challenges of integrating BioinfoPortal and CSGrid
framework, which considers the general process of the installation,
configuration, and deployment. Finally, we present the findings of the
performance analysis of high-performance computing applications,
presenting how machine learning was applied to optimize the
functionality of BioinfoPortal based on recommending predictive models for the efficient allocation of resources obtained over 75% of performance efficiency.
Projection of hospitalization by COVID-19 in Brazil following different social distance policies
Following the first infections in the Wuhan City, COVID-19 became a global pandemic as declared by the World Health Organization (WHO). Since it is an airborne disease transmitted between humans, many countries adopted a quarantine on their own population as well as closing borders measures. In Brazil many are discussing the best way to manage the opening of the quarantine under the constraints of hospitals infrastructure. In this work we implement a forecast of the demand for hospital beds for the next 30 days for every state in Brazil and under different quarantine flexibilization scenarios and analyse how long it would take for the demand to exceed current available hospital beds.
SUQ2 : Uncertainty Quantification Queries over Large Spatio-temporal Simulations
Abstract:The combination of high-performance computing towards Exascale power and numerical techniques enables exploring complex physical phenomena using large-scale spatio-temporal modeling and simulation. The improvements on the fidelity of phenomena simulation require more sophisticated uncertainty quantification analysis, leaving behind measurements restricted to low order statistical moments and moving towards more expressive probability density functions models of uncertainty. In this paper, we consider the problem of answering uncertainty quantification queries over large spatio-temporal simulation results. We propose the SUQ2 method based on the Generalized Lambda Distribution (GLD) function. GLD fitting is an embarrassingly parallel process that scales linearly to the number of available cores on the number of simulation points. Furthermore, the answer of queries is entirely based on computed GLDs and the corresponding clusters, which enables trading the huge amount of simulation output data by 4 values in the GLD parametrization per simulation point. The methodology presented in this paper becomes an important ingredient in converging simulations improvements to the Exascale computational power.
Efficient Network Seeding under Variable Node Cost and Limited Budget for Social Networks
The efficiency of information diffusion on networks highly depends on both the network structure and the set of early spreaders. Moreover, in various realistic scenarios, to seed different nodes implies different costs, as in the case of viral marketing, where costs often correlate with local network structure. The budgeted influence maximization (BIM) problem consists in determining a seed set whose diffusion maximizes the total number of influenced nodes, provided that the seeding cost is within a given budget. We investigate efficient seeding strategies for the BIM problem under the deterministic fixed threshold diffusion model. In particular, we introduce the concept of surrounding sets: relatively cheap seeds neighboring expensive, structurally-privileged nodes, which then become spreaders at lower costs. Numerical experiments with several real networks indicate our method outperforms strategies that seed nodes based on their influence/cost ratios. A key insight from our evaluation is that larger diffusion is generally attained from the surrounding sets that consider the two-hop neighborhood of influential nodes, as opposed to their immediate neighbors only.
Parallel computation of PDFs on big spatial data using Spark
We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.
New perspectives on analysing data from biological collections based on social network analytics
Biological collections have been historically regarded as fundamental
sources of scientific information on biodiversity. They are commonly
associated with a variety of biases, which must be characterized and
mitigated before data can be consumed. In this work, we are motivated by
taxonomic and collector biases, which can be understood as the effect
of particular recording preferences of key collectors on shaping the
overall taxonomic composition of biological collections they contribute
to. In this context, we propose two network models as the first steps
towards a network-based conceptual framework for understanding the
formation of biological collections as a result of the composition of
collectors’ interests and activities. Building upon the defined network
models, we present a case study in which we use our models to explore
the community of collectors and the taxonomic composition of the
University of Brasília herbarium. We describe topological features of
the networks and point out some of the most relevant collectors in the
biological collection as well as their taxonomic groups of interest. We
also investigate their collaborative behaviour while recording
specimens. Finally, we discuss future perspectives for incorporating
temporal and geographical dimensions to the models. Moreover, we
indicate some possible investigation directions that could benefit from
our approach based on social network analytics to model and analyse
Towards Optimizing the Execution of Spark Scientific Workflows Using Machine Learning based Parameter Tuning
Graph-Based Skill Acquisition for Reinforcement Learning
ACM Computing Surveys (CSUR), ISSN: 0360-0300, vol. 52, issue 1, article no. 6
In machine learning, Reinforcement Learning (RL) is an important tool
for creating intelligent agents that learn solely through experience.
One particular subarea within the RL domain that has received great
attention is how to define macro-actions, which are temporal
abstractions composed of a sequence of primitive actions. This subarea,
loosely called skill acquisition, has been under development for several
years and has led to better results in a diversity of RL problems.
Among the many skill acquisition approaches, graph-based methods have
received considerable attention. This survey presents an overview of
graph-based skill acquisition methods for RL. We cover a diversity of
these approaches and discuss how they evolved throughout the years.
Finally, we also discuss the current challenges and open issues in the
area of graph-based skill acquisition for RL.
Understanding Human Mobility and Workload Dynamics Due To Different Large-Scale Events Using Mobile Phone Data
Journal of Network and Systems Management (JONS), Springer, ISSN: 1064-7570, vol. 26, no. 4, pp. 1079-1100,
The analysis of mobile phone data can help carriers to improve the way they deal with unusual workloads imposed by large-scale events. This paper analyzes human mobility and the resulting dynamics in the network workload caused by three different types of large-scale events: a major soccer match, a rock concert, and a New Year’s Eve celebration, which took place in a large Brazilian city. Our analysis is based on the characterization of records of mobile phone calls made around the time and place of each event. That is, human mobility and network workload are analyzed in terms of the number of mobile phone calls, their inter-arrival and inter-departure times, and their durations. We use heat maps to visually analyze the spatio-temporal dynamics of the movement patterns of the participants of the large-scale event. The results obtained can be helpful to improve the understanding of human mobility caused by large-scale events. Such results could also provide valuable insights for network managers into effective capacity management and planning strategies. We also present PrediTraf, an application built to help the cellphone carriers plan their infrastructure on large-scale events.
BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments
PeerJ, 6, e5551.
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and
data-intensive, they require high-performance computing techniques and
can benefit from specialized technologies such as Scientific Workflow
Management Systems and databases. In this work, we present BioWorkbench,
a framework for managing and analyzing bioinformatics experiments. This
framework automatically collects provenance data, including both
performance data from workflow execution and data from the scientific
domain of the workflow application. Provenance data can be analyzed
through a web application that abstracts a set of queries to the
provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a
phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics
workflow; and RASflow, a RASopathy analysis workflow. We analyze each
workflow from both computational and scientific domain perspectives, by
using queries to a provenance and annotation database. Some of these
queries are available as a pre-built feature of the BioWorkbench web
application. Through the provenance data, we show that the framework is
scalable and achieves high-performance, reducing up to 98% of the case
studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process.
GeNNet: an integrated platform for unifying scientific workflows and graph databases for transcriptome data analysis.
PeerJ, 5, e3509.
There are many steps in analyzing transcriptome data, from the
acquisition of raw data to the selection of a subset of representative
genes that explain a scientific hypothesis. The data produced can be
represented as networks of interactions among genes and these may
additionally be integrated with other biological databases, such as
Protein-Protein Interactions, transcription factors and gene annotation.
However, the results of these analyses remain fragmented, imposing
difficulties, either for posterior inspection of results, or for
meta-analysis by the incorporation of new related data. Integrating
databases and tools into scientific workflows, orchestrating their
execution, and managing the resulting data and its respective metadata
are challenging tasks. Additionally, a great amount of effort is equally
required to run in-silico experiments to structure and compose the
information as needed for analysis. Different programs may need to be
applied and different files are produced during the experiment cycle. In
this context, the availability of a platform supporting experiment
execution is paramount. We present GeNNet, an integrated transcriptome
analysis platform that unifies scientific workflows with graph databases
for selecting relevant genes according to the evaluated biological
systems. It includes GeNNet-Wf, a scientific workflow that pre-loads
biological data, pre-processes raw microarray data and conducts a series
of analyses including normalization, differential expression inference,
clusterization and gene set enrichment analysis. A user-friendly web
interface, GeNNet-Web, allows for setting parameters, executing, and
visualizing the results of GeNNet-Wf executions. To demonstrate the
features of GeNNet, we performed case studies with data retrieved from
GEO, particularly using a single-factor experiment in different analysis
scenarios. As a result, we obtained differentially expressed genes for
which biological functions were analyzed. The results are integrated
into GeNNet-DB, a database about genes, clusters, experiments and their
properties and relationships. The resulting graph database is explored
with queries that demonstrate the expressiveness of this data model for
reasoning about gene interaction networks. GeNNet is the first platform
to integrate the analytical process of transcriptome data with graph
databases. It provides a comprehensive set of tools that would otherwise
be challenging for non-expert users to install and use. Developers can
add new functionality to components of GeNNet. The derived data allows
for testing previous hypotheses about an experiment and exploring new
ones through the interactive graph database environment. It enables the
analysis of different data on humans, rhesus, mice and rat coming from
Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.
MobHet: Predicting Human Mobility Using Heterogeneous Data Sources
Computer Communications, Special issue on Mobile Traffic Analysis, Elsevier Science, ISSN: 0140-3664, vol. 95, pp. 54-68
The literature is rich in mobility models that aim at predicting human mobility. Yet, these models typically consider only a single kind of data source, such as data from mobile calls or location data obtained from GPS and web applications. Thus, the robustness and effectiveness of such data-driven models from the literature remain unknown when using heterogeneous types of data. In contrast, this paper proposes a novel family of data-driven models, called MobHet, to predict human mobility using heterogeneous data sources. Our proposal is designed to use a combination of features capturing the popularity of a region, the frequency of transitions between regions, and the contacts of a user, which can be extracted from data obtained from various sources, both separately and conjointly. We evaluate the MobHet models, comparing them among themselves and with two single-source data-driven models, namely SMOOTH and Leap Graph, while considering different scenarios with single as well as multiple data sources. Our experimental results show that our best MobHet model produces results that are better than or at least comparable to the best baseline in all considered scenarios, unlike the previous models whose performance is very dependent on the particular type of data used. Our results thus attest the robustness of our proposed solution to the use of heterogeneous data sources in predicting human mobility.
On MultiAspect Graphs
Theoretical Computer Science (TCS), Elsevier Science, ISSN: 0304-3975, vol. 651, pp. 50-61
Different graph generalizations have been recently used in an ad-hoc manner to represent multilayer networks, i.e. systems formed by distinct layers where each layer can be seen as a network. Similar constructions have also been used to represent time-varying networks. We introduce the concept of MultiAspect Graph (MAG) as a graph generalization that we prove to be isomorphic to a directed graph, and also capable of representing all previous generalizations. In our proposal, the set of vertices, layers, time instants, or any other independent features are considered as an aspect of the MAG. For instance, a MAG is able to represent multilayer or time-varying networks, while both concepts can also be combined to represent a multilayer time-varying network and even other higher-order networks. Since the MAG structure admits an arbitrary (finite) number of aspects, it hence introduces a powerful modeling abstraction for networked complex systems. This paper formalizes the concept of MAG and derives theoretical results useful in the analysis of complex networked systems modeled using the proposed MAG abstraction. We also present an overview of the MAG applicability.
A note on the complexity of the causal ordering problem
In this note we provide a concise report on the complexity of the causal ordering problem, originally introduced by Simon to reason about causal dependencies implicit in systems of mathematical equations. We show that Simon's classical algorithm to infer causal ordering is NP-Hard—an intractability previously guessed but never proven. We present then a detailed account based on Nayak's suggested algorithmic solution (the best available), which is dominated by computing transitive closure—bounded in time by , where is the input system structure composed of a set of equations over a set of variables with number of variable appearances (density) . We also comment on the potential of causal ordering for emerging applications in large-scale hypothesis management and analytics.
Managing Scientific Hypotheses as Data with Support for Predictive Analytics
Computing in Science and Engineering 17(5): 35-43 (
The sheer scale of high-resolution raw data generated by simulation has
motivated nonconventional approaches for data exploration, referred to
as immersive and in situ query processing. Another step toward
supporting scientific progress is to enable data-driven hypothesis
management and predictive analytics out of simulation results. The
authors of this article present a synthesis method and tool for encoding
and managing competing hypotheses as uncertain data in a probabilistic
database that can be conditioned in the presence of observations.
BaMBa: towards the integrated management of Brazilian marine environmental data
A new open access database, Brazilian Marine Biodiversity (BaMBa) ( https://marinebiodiversity.lncc.br
), was developed in order to maintain large datasets from the Brazilian
marine environment. Essentially, any environmental information can be
added to BaMBa. Certified datasets obtained from integrated holistic
studies, comprising physical–chemical parameters, -omics, microbiology,
benthic and fish surveys can be deposited in the new database, enabling
scientific, industrial and governmental policies and actions to be
undertaken on marine resources. There is a significant number of
databases, however BaMBa is the only integrated database resource both
supported by a government initiative and exclusive for marine data.
BaMBa is linked to the Information System on Brazilian Biodiversity
(SiBBr, http://www.sibbr.gov.br/ ) and will offer opportunities for improved governance of marine resources and scientists’ integration.
Y-DB: Managing scientific hypotheses as uncertain data
PVLDB 7(11): 959-962
In view of the paradigm shift that makes science ever more data-driven, we consider deterministic scientific hypotheses as uncertain data. This vision comprises a probabilistic database (p-DB) design methodology for the systematic construction and management of U-relational hypothesis DBs, viz., Υ-DBs. It introduces hypothesis management as a promising new class of applications for p-DBs. We illustrate the potential of Υ-DB as a tool for deep predictive analytics.
Applying Provenance to Protect Attribution in Distributed Computational Scientific Experiments. In Provenance and Annotation of Data and Processes
IPAW 2014. Lecture
The automation of large scale computational scientific experiments can be accomplished with the use of scientific workflow management systems, which allow for the definition of their activities and data dependencies. The manual analysis of the data resulting from their execution is burdensome, due to the usually large amounts of information. Provenance systems can be used to support this task since they gather details about the design and execution of these experiments. However, provenance information disclosure can also be seen as a threat to correct attribution, if the proper security mechanisms are not in place to protect it. In this article, we address the problem of providing adequate security controls for protecting provenance information taking into account requirements that are specific to e-Science. Kairos, a provenance security architecture, is proposed to protect both prospective and retrospective provenance, in order to reduce the risk of intellectual property disputes in computational scientific experiments.
MTCProv: a practical provenance query framework for many-task scientific computing.
Distributed and Parallel Databases, 30(5–6), 351–370.
Scientific research is increasingly assisted by computer-based experiments. Such experiments are often composed of a vast number of loosely-coupled computational tasks that are specified and automated as scientific workflows. This large scale is also characteristic of the data that flows within such “many-task” computations (MTC). Provenance information can record the behavior of such computational experiments via the lineage of process and data artifacts. However, work to date has focused on lineage data models, leaving unsolved issues of recording and querying other aspects, such as domain-specific information about the experiments, MTC behavior given by resource consumption and failure information, or the impact of environment on performance and accuracy. In this work we contribute with MTCProv, a provenance query framework for many-task scientific computing that captures the runtime execution details of MTC workflow tasks on parallel and distributed systems, in addition to standard prospective and data derivation provenance. To help users query provenance data we provide a high level interface that hides relational query complexities. We evaluate MTCProv using an application in protein science, and describe how important query patterns such as correlations between provenance, runtime data, and scientific parameters are simplified and expressed.
A Conceptual View on Trajectories
Journal of Data and Knowledge Engineering, pp.126-146, ISSN:0169-023X, V(65)
Stefano Analysis of trajectory data is the key to a growing number of applications aiming at global understanding and management of complex phenomena that involve moving objects (e.g. worldwide courier distribution, city traffic management, bird migration monitoring). Current DBMS support for such data is limited to the ability to store and query raw movement (i.e. the spatio-temporal position of an object). This paper explores how conceptual modeling could provide applications with direct support of trajectories (i.e. movement data that is structured into countable semantic units) as a first class concept. A specific concern is to allow enriching trajectories with semantic annotations allowing users to attach semantic data to specific parts of the trajectory. Building on a preliminary requirement analysis and an application example, the paper proposes two modeling approaches, one based on a design pattern, the other based on dedicated data types, and illustrates their differences in terms of implementation in an extended-relational context.