Brasil BRASIL
Contact
Year Publication
2022

A Data-Driven Model Selection Approach toSpatio-Temporal Prediction

Roc ́ıo Zorrilla, Eduardo Ogasawara,Patrick Valduriez, F ́abio Porto

Spatio-temporal  Predictive  Queries  encompass  a  spatio-temporalconstraint, defining a region, a target variable, and an evaluation metric.  Theoutput of such queries presents the future values for the target variable computedby predictive models at each point of the spatio-temporal region. Unfortunately,especially for large spatio-temporal domains with millions of points, trainingtemporal models at each spatial domain point is prohibitive.  In this work, wepropose a data-driven approach for selecting pre-trained temporal models tobe  applied  at  each  query  point.   The  chosen  approach  applies  a  model  to  apoint according to the training and input time series similarity.  The approachavoids training a different model for each domain point, saving model trainingtime.  Moreover, it provides a technique to decide on the best-trained model tobe applied to a point for prediction.  In order to assess the applicability of theproposed strategy, we evaluate a case study for temperature forecasting usinghistorical data and auto-regressive models.  Computational experiments showthat the proposed approach, compared to the baseline, achieves equivalent pre-dictive performance using a composition of pre-trained models at a fraction ofthe total computational cost.

2021

Method for Treating Anomalies in Multivariate Time Series

Thiago Moeda, Mariza Ferro, Eduardo Ogasawara, Fabio Porto

Conference: CILAMCE 2021


Abstract:

In recent years, the classification of time series has gained great relevance in significant sectors and segments of society. Machine Learning Techniques make it possible to interpret the behavior of anomalous phenomena in multivariate datasets. This work proposes a study of three methods from the perspective of their ability to provide relevant information for the detection, validation and prediction of anomalous events in time series data. To achieve this goal, a case study was carried out exploring algorithms based on neural networks and inductive symbolic learning applied to a real problem of detecting anomalies associated with the oil well drilling process. The main results indicate that this method can be a promising way to treat anomalies.

2021

Managing Sparse Spatio-Temporal Data in SAVIME: An Evaluation of the Ph-tree Index

Stiw Herrera, Larissa Miguez da Silva, Paulo Ricardo Reis, Anderson Silva, Fabio Porto

Conference:

36° Simpósio Brasileiro de Banco de Dados (SBBD 2021)

Abstract:

Scientific data is mainly multidimensional in its nature, presenting interesting opportunities for optimizations when managed by array databases. However, in scenarios where data is sparse, an efficient implementation is still required. In this paper, we investigate the adoption of the Ph-tree as an in-memory indexing structure for sparse data. We compare the performance in data ingestion and in both range and punctual queries, using SAVIME as the multidimensional array DBMS. Our experiments, using a real weather dataset, highlights the challenges involving providing a fast data ingestion, as proposed by SAVIME, and at the same time efficiently answering multidimensional queries on sparse data.

2021

Requirements for an Ontology of Digital Twins

Claudio Barros, Rebecca Salles, Eduardo Ogasawara, Giancarlo Guizzardi, Fabio Porto

Workshop:
CEUR WORKSHOP PROCEEDINGS - FIRST WORKSHOP ON ONTOLOGY-DRIVEN CONCEPTUAL MODELLING OF DIGITAL TWINS (OCDM-DT 2021)


Abstract:

Digital twin connects concrete systems to digital representations, encoding the real world using software systems, tools and models.
Therefore, digital twins should comprise abstractions, formal namings and definitions of categories, properties and relations between concepts, data and entities substantiating one, many or every element of some domain of interest. Considering the possible synergies between digital twins and ontology, and the growing demand for connecting the physical and the virtual world through explicit ontological grounding, our work proposes preliminary discussions about requirements to build an ontology of digital twins. We outline some relevant topics both in the field of digital twins and ontology that are important for the proposal of core reference ontology in the field. We also explore these requirements in detail, from the conception and creation of the virtual environment and the digital twins, to the synchronization between digital world and real world, in addition to computational services, including visualization, prediction, and prescription . Finally, we present topics for future work.


2021

DJEnsemble: a Cost-Based Selection and Allocation of a Disjoint Ensemble of Spatio-temporal Models

Rafael S. Pereira, Yania Molina Souto, Anderson Silva, Rocio Zorrilla, Brian Tsan, Florin Rusu, Eduardo Ogasawara, Arthur Ziviani, Fabio Porto

Abstract:

Consider a set of black-box models – each of them independently trained on a different dataset – answering the same predictive spatio-temporal query. Being built in isolation, each model traverses its own life-cycle until it is deployed to production, learning data patterns from different datasets and facing independent hyper-parameter tuning. In order to answer the query, the set of black-box predictors has to be ensembled and allocated to the spatio-temporal query region. However, computing an optimal ensemble is a complex task that involves selecting the appropriate models and defining an effective allocation strategy that maps the models to the query region. In this paper we present DJEnsemble, a cost-based strategy for the automatic selection and allocation of a disjoint ensemble of black-box predictors to answer predictive spatio-temporal queries. We conduct a set of extensive experiments that evaluate DJEnsemble and highlight its efficiency, selecting model ensembles that are almost as efficient as the optimal solution. When compared against the traditional ensemble approach, DJEnsemble achieves up to 4X improvement in execution time and almost 9X improvement in prediction accuracy.

2020

An Algorithmic Information Distortion in Multidimensional Networks Cria

Felipe S. Abrahão, Klaus Wehmuth, Hector Zenil, Artur Ziviani

Abstract:

Network complexity, network information content analysis, and lossless compressibility of graph representations have been played an important role in network analysis and network modeling. As multidimensional networks, such as time-varying, multilayer, or dynamic multilayer networks, gain more relevancy in network science, it becomes crucial to investigate in which situations universal algorithmic methods based on algorithmic information theory applied to graphs cannot be straightforwardly imported into the multidimensional case. In this direction, as a worst-case scenario of lossless compressibility distortion that increases linearly with the number of distinct dimensions, this article presents a counter-intuitive phenomenon that occurs when dealing with networks within non-uniform and sufficiently large multidimensional spaces. In particular, we demonstrate that the algorithmic information necessary to encode multidimensional networks that are isomorphic to logarithmically compressible monoplex networks may display exponentially larger distortions in the general case.

2019

Machine Learning and Knowledge Graph Inference

Daniel N. R. da Silva, Artur Ziviani, Fabio Porto

Abstract:

The increasing production and availability of massive and heterogeneous data bringforward challenging opportunities. Among them, the development of computing systemscapable of learning, reasoning, and inferring facts based on prior knowledge. In this sce-nario, knowledge bases are valuable assets for the knowledge representation and automa-ted reasoning of diverse application domains. Especially, inference tasks on knowledgegraphs (knowledge bases’ graphical representations) are increasingly important in aca-demia and industry. In this short course, we introduce machine learning methods andtechniques employed in knowledge graph inference tasks as well as discuss the technicaland scientific challenges and opportunities associated with those tasks.

2019

SAVIME: A Database Management System for Simulation Data Analysis and Visualization

Hermano Lustosa, Fabio Porto, Patrick Valduriez

Abstract:

Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make scientific applications benefit from DBMS support, enabling declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability

2019

A conceptual vision toward the management of Machine Learning models

Daniel N. R. da Silva, Yania Souto, Adolfo Simões, Carlos Cardoso, João N. Rittmeyer, Hermano Lustosa, Luciana E. G. Vignoli, Rebecca Salles, Eduardo Ogasawara, Flavia C. Delicato, Paulo de F. Pires, Artur Ziviani and Fabio Porto

Abstract:

To turn big data into actionable knowledge, the adoption of machine learning (ML) methods has proven to be one of the de facto approaches.
When elaborating an appropriate ML model for a given task, one typically builds many models and generates several data artifacts.
Given the amount of information associated with the developed models performance, their appropriate selection is often difficult. Therefore, appropriately comparing a set of competitive ML models and choosing one according to an arbitrary set of user metrics require systematic solutions.
In particular, ML model management is a promising research direction for a more systematic and comprehensive approach for machine learning model selection. Therefore, in this paper, we introduce a conceptual model for ML development. Based on this conceptualization, we introduce our vision toward a knowledge-based model management system oriented to model selection.

2019

Deep Learning Application for Plant Classification on Unbalanced Training Set

Rafael Silva Pereira, Fábio Porto

Abstract:

Deep learning models expect a reasonable amount of training in- stances to improve prediction quality. Moreover, in classification problems, the occurrence of an unbalanced distribution may lead to a biased model. In this paper, we investigate the problem of species classification from plant images, where some species have very few image samples. We explore reduced versions of imagenet Neural Network winners architecture to filter the space of candi- date matches, under a target accuracy level. We show through experimental results using real unbalanced plant image datasets that our approach can lead to classifications within the 5 best positions with high probability.

2019

Dealing with categorical missing data using CleanerR

Rafael Silva Pereira, Fábio Porto

Abstract:

Missing data is a common problem in the world of data analysis. They appear in datasets due to a multitude of reasons, from data integration to poor data input. When faced with the problem, the analyst must decide what to do with the missing data since its not always advisable to discard these values from your analysis. On this paper we shall discuss a method that takes into account information theory and functional dependencies to best imput missing values.


2019

SDN-Based Architecture for Providing Quality of Service to High Performance Distributed Applications

Alexandre T. Oliveira, Bruno J. C. A. Martins, Marcelo F. Moreno, Antônio Tadeu A. Gomes, Artur Ziviani, Alex Borges

Abstract:

The specification of quality of service (QoS) requirements in most of the existing networks is still challenging. In part, traditional network environments are limited by their high administrative cost, although software‐defined networks (SDNs), a newer network paradigm, simplify the management of the whole network infrastructure. In fact, SDN provides a simple way to effectively develop QoS provisioning mechanisms. In this sense, we explore the SDN model and its flexibility to develop a QoS provisioning architecture. Through the use of our new architecture, network operators are able to specify QoS levels in a simple way. Each individual data flow can be addressed, and the architecture we propose also negotiates the QoS requirements between the network controller and applications. On the other hand, the network controller continuously monitors the network environment. Then, it allocates network elements resources and prioritizes traffic, adjusting the network performance. We evaluate the feasibility of our QoS provisioning mechanism by presenting three experimental setups under realistic scenarios. For example, for a given scenario where we evaluate file transfers, our results indicate that the additional SDN modules present negligible overhead. Moreover, for a given setup, we observe a reduction of up to 82% in the file transfer times. Software‐defined networks (SDNs) simplify the management of network infrastructure and provides a simple way to effectively develop quality of service (QoS) provisioning mechanisms. In this article, we explore the SDN model and its flexibility to develop a QoS provisioning architecture. The architecture we propose negotiates the QoS requirements, at a flow granularity, between the network controller and applications and continuously monitors the network environment, adjusting the network performance to guarantee the QoS accordingly.

2018

Constellation Queries over Big Data

Fabio Porto, Amir Khatibi, Joao N. Rittmeyer, Eduardo Ogasawara, Patrick Valduriez, Dennis Shasha

Abstract:


A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. Finding geometric patterns is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we propose algorithms to find patterns in large data applications. Our methods combine quadtrees, matrix multiplication, and bucket join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm.

2018

Point pattern search in big data

Fabio Porto, João N. Rittmeyer, Eduardo Ogasawara, Alberto Krone-Martins, Patrick Valduriez, Dennis Shasha

SSDBM 201821:1-21:12

Abstract:

Consider a set of points P in space with at least some of the pairwise distances specified. Given this set P, consider the following three kinds of queries against a database D of points : (i) pure constellation query: find all sets S in D of size |P| that exactly match the pairwise distances within P up to an additive error ϵ; (ii) isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f for which the distances between pairs in S exactly match f times the distances between corresponding pairs of P up to an additive ϵ; (iii) non-isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f and for at least some pairs of points, a maximum stretch factor mi,j > 1 such that (f X mi,jXdist(pi, pj))+ϵ > dist(si,sj) > (f X dist(pi, pj)) - ϵ. Finding matches to such queries has applications to spatial data in astronomical, seismic, and any domain in which (approximate, scale-independent) geometrical matching is required. Answering the isotropic and non-isotropic queries is challenging because scale factors and stretch factors may take any of an infinite number of values. This paper proposes practically efficient sequential and distributed algorithms for pure, isotropic, and non-isotropic constellation queries. As far as we know, this is the first work to address isotropic and non-isotropic queries.


2017

TARS: An Array Model with Rich Semantics for Multidimensional Data

Hermano Lustosa, Noel Lemus, Fabio Porto, Patrick Valduriez

 ER Forum/Demos 2017114-127


Abstract:

Relational DBMSs have been shown to be inefficient for scientific data management. One main reason is the difficulty to represent arrays, which are frequently adopted as a data model for scientific datasets representation. Array DBMSs, e.g. SciDB, were proposed to bridge this gap, building on a native array representation. Unfortunately, important scientific applications, such as numerical simulation, have additional requirements, in particular to deal with mesh topology and geometry. First, transforming simulation results datasets into DBMS array format incurs in huge latency due to the fixed format of array DBMSs layouts and data transformations to adapt to mesh data characteristics. Second, simulation applications require data visualization or computing uncertainty quantification (UQ), both requiring metadata beyond the simulation output array. To address these problems, we propose a novel data model called TARS (Typed ARray Schema), which extends the basic array data model with typed arrays. In TARS, the support of application dependent data characteristics, such as data visualization and UQ computation, is provided through the definition of TAR objects, ready to be manipulated by TAR operators. This approach provides much flexibility for capturing internal data layouts through mapping functions, which makes data ingestion independent of how simulation data has been produced, thus minimizing ingestion time. In this paper, we present the TARS data model and illustrate its use in the context of numerical simulation application.

2016

Database System Support of Simulation Data

Hermano Lustosa, Fabio Porto, Pablo Blanco, Patrick Valduriez

2016): 1329-1340 (PVLDB 9(13))

Abstract:

Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row-store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.

2015

A Unifying Model for Representing Time-Varying Graphs

Klaus Wehmuth, Artur Ziviani, Eric Fleury

IEEE International Conference on Data Science and Advanced Analytics - IEEE DSAA 2015, Paris, France

Abstract:

Graph-based models form a fundamental aspect of data representation in Data Sciences and play a key role in modeling complex networked systems. In particular, recently there is an ever-increasing interest in modeling dynamic complex networks, i.e. networks in which the topological structure (nodes and edges) may vary over time. In this context, we propose a novel model for representing finite discrete Time-Varying Graphs (TVGs), which are typically used to model dynamic complex networked systems. We analyze the data structures built from our proposed model and demonstrate that, for most practical cases, the asymptotic memory complexity of our model is in the order of the cardinality of the set of edges. Further, we show that our proposal is an unifying model that can represent several previous (classes of) models for dynamic networks found in the recent literature, which in general are unable to represent each other. In contrast to previous models, our proposal is also able to intrinsically model cyclic (i.e. periodic) behavior in dynamic networks. These representation capabilities attest the expressive power of our proposed unifying model for TVGs. We thus believe our unifying model for TVGs is a step forward in the theoretical foundations for data analysis of complex networked systems.