Brasil BRASIL
Contact
Year Publication
2020

An Algorithmic Information Distortion in Multidimensional Networks Cria

Felipe S. Abrahão, Klaus Wehmuth, Hector Zenil, Artur Ziviani

Abstract:

Network complexity, network information content analysis, and lossless compressibility of graph representations have been played an important role in network analysis and network modeling. As multidimensional networks, such as time-varying, multilayer, or dynamic multilayer networks, gain more relevancy in network science, it becomes crucial to investigate in which situations universal algorithmic methods based on algorithmic information theory applied to graphs cannot be straightforwardly imported into the multidimensional case. In this direction, as a worst-case scenario of lossless compressibility distortion that increases linearly with the number of distinct dimensions, this article presents a counter-intuitive phenomenon that occurs when dealing with networks within non-uniform and sufficiently large multidimensional spaces. In particular, we demonstrate that the algorithmic information necessary to encode multidimensional networks that are isomorphic to logarithmically compressible monoplex networks may display exponentially larger distortions in the general case.

2019

Machine Learning and Knowledge Graph Inference

Daniel N. R. da Silva, Artur Ziviani, Fabio Porto

Abstract:

The increasing production and availability of massive and heterogeneous data bringforward challenging opportunities. Among them, the development of computing systemscapable of learning, reasoning, and inferring facts based on prior knowledge. In this sce-nario, knowledge bases are valuable assets for the knowledge representation and automa-ted reasoning of diverse application domains. Especially, inference tasks on knowledgegraphs (knowledge bases’ graphical representations) are increasingly important in aca-demia and industry. In this short course, we introduce machine learning methods andtechniques employed in knowledge graph inference tasks as well as discuss the technicaland scientific challenges and opportunities associated with those tasks.

2019

SAVIME: A Database Management System for Simulation Data Analysis and Visualization

Hermano Lustosa, Fabio Porto, Patrick Valduriez

Abstract:

Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make scientific applications benefit from DBMS support, enabling declarative data analysis and visualization over scientific data, we present an in-memory array DBMS system called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability

2019

A conceptual vision toward the management of Machine Learning models

Daniel N. R. da Silva, Yania Souto, Adolfo Simões, Carlos Cardoso, João N. Rittmeyer, Hermano Lustosa, Luciana E. G. Vignoli, Rebecca Salles, Eduardo Ogasawara, Flavia C. Delicato, Paulo de F. Pires, Artur Ziviani and Fabio Porto

Abstract:

To turn big data into actionable knowledge, the adoption of machine learning (ML) methods has proven to be one of the de facto approaches.
When elaborating an appropriate ML model for a given task, one typically builds many models and generates several data artifacts.
Given the amount of information associated with the developed models performance, their appropriate selection is often difficult. Therefore, appropriately comparing a set of competitive ML models and choosing one according to an arbitrary set of user metrics require systematic solutions.
In particular, ML model management is a promising research direction for a more systematic and comprehensive approach for machine learning model selection. Therefore, in this paper, we introduce a conceptual model for ML development. Based on this conceptualization, we introduce our vision toward a knowledge-based model management system oriented to model selection.

2019

Deep Learning Application for Plant Classification on Unbalanced Training Set

Rafael Silva Pereira, Fábio Porto

Abstract:

Deep learning models expect a reasonable amount of training in- stances to improve prediction quality. Moreover, in classification problems, the occurrence of an unbalanced distribution may lead to a biased model. In this paper, we investigate the problem of species classification from plant images, where some species have very few image samples. We explore reduced versions of imagenet Neural Network winners architecture to filter the space of candi- date matches, under a target accuracy level. We show through experimental results using real unbalanced plant image datasets that our approach can lead to classifications within the 5 best positions with high probability.

2019

Dealing with categorical missing data using CleanerR

Rafael Silva Pereira, Fábio Porto

Abstract:

Missing data is a common problem in the world of data analysis. They appear in datasets due to a multitude of reasons, from data integration to poor data input. When faced with the problem, the analyst must decide what to do with the missing data since its not always advisable to discard these values from your analysis. On this paper we shall discuss a method that takes into account information theory and functional dependencies to best imput missing values.


2019

SDN-Based Architecture for Providing Quality of Service to High Performance Distributed Applications

Alexandre T. Oliveira, Bruno J. C. A. Martins, Marcelo F. Moreno, Antônio Tadeu A. Gomes, Artur Ziviani, Alex Borges

Abstract:

The specification of quality of service (QoS) requirements in most of the existing networks is still challenging. In part, traditional network environments are limited by their high administrative cost, although software‐defined networks (SDNs), a newer network paradigm, simplify the management of the whole network infrastructure. In fact, SDN provides a simple way to effectively develop QoS provisioning mechanisms. In this sense, we explore the SDN model and its flexibility to develop a QoS provisioning architecture. Through the use of our new architecture, network operators are able to specify QoS levels in a simple way. Each individual data flow can be addressed, and the architecture we propose also negotiates the QoS requirements between the network controller and applications. On the other hand, the network controller continuously monitors the network environment. Then, it allocates network elements resources and prioritizes traffic, adjusting the network performance. We evaluate the feasibility of our QoS provisioning mechanism by presenting three experimental setups under realistic scenarios. For example, for a given scenario where we evaluate file transfers, our results indicate that the additional SDN modules present negligible overhead. Moreover, for a given setup, we observe a reduction of up to 82% in the file transfer times. Software‐defined networks (SDNs) simplify the management of network infrastructure and provides a simple way to effectively develop quality of service (QoS) provisioning mechanisms. In this article, we explore the SDN model and its flexibility to develop a QoS provisioning architecture. The architecture we propose negotiates the QoS requirements, at a flow granularity, between the network controller and applications and continuously monitors the network environment, adjusting the network performance to guarantee the QoS accordingly.

2018

Constellation Queries over Big Data

Fabio Porto, Amir Khatibi, Joao N. Rittmeyer, Eduardo Ogasawara, Patrick Valduriez, Dennis Shasha

Abstract:


A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. Finding geometric patterns is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we propose algorithms to find patterns in large data applications. Our methods combine quadtrees, matrix multiplication, and bucket join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm.

2018

Point pattern search in big data

Fabio Porto, João N. Rittmeyer, Eduardo Ogasawara, Alberto Krone-Martins, Patrick Valduriez, Dennis Shasha

SSDBM 201821:1-21:12

Abstract:

Consider a set of points P in space with at least some of the pairwise distances specified. Given this set P, consider the following three kinds of queries against a database D of points : (i) pure constellation query: find all sets S in D of size |P| that exactly match the pairwise distances within P up to an additive error ϵ; (ii) isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f for which the distances between pairs in S exactly match f times the distances between corresponding pairs of P up to an additive ϵ; (iii) non-isotropic constellation queries: find all sets S in D of size |P| such that there exists some scale factor f and for at least some pairs of points, a maximum stretch factor mi,j > 1 such that (f X mi,jXdist(pi, pj))+ϵ > dist(si,sj) > (f X dist(pi, pj)) - ϵ. Finding matches to such queries has applications to spatial data in astronomical, seismic, and any domain in which (approximate, scale-independent) geometrical matching is required. Answering the isotropic and non-isotropic queries is challenging because scale factors and stretch factors may take any of an infinite number of values. This paper proposes practically efficient sequential and distributed algorithms for pure, isotropic, and non-isotropic constellation queries. As far as we know, this is the first work to address isotropic and non-isotropic queries.


2017

TARS: An Array Model with Rich Semantics for Multidimensional Data

Hermano Lustosa, Noel Lemus, Fabio Porto, Patrick Valduriez

 ER Forum/Demos 2017114-127


Abstract:

Relational DBMSs have been shown to be inefficient for scientific data management. One main reason is the difficulty to represent arrays, which are frequently adopted as a data model for scientific datasets representation. Array DBMSs, e.g. SciDB, were proposed to bridge this gap, building on a native array representation. Unfortunately, important scientific applications, such as numerical simulation, have additional requirements, in particular to deal with mesh topology and geometry. First, transforming simulation results datasets into DBMS array format incurs in huge latency due to the fixed format of array DBMSs layouts and data transformations to adapt to mesh data characteristics. Second, simulation applications require data visualization or computing uncertainty quantification (UQ), both requiring metadata beyond the simulation output array. To address these problems, we propose a novel data model called TARS (Typed ARray Schema), which extends the basic array data model with typed arrays. In TARS, the support of application dependent data characteristics, such as data visualization and UQ computation, is provided through the definition of TAR objects, ready to be manipulated by TAR operators. This approach provides much flexibility for capturing internal data layouts through mapping functions, which makes data ingestion independent of how simulation data has been produced, thus minimizing ingestion time. In this paper, we present the TARS data model and illustrate its use in the context of numerical simulation application.

2016

Database System Support of Simulation Data

Hermano Lustosa, Fabio Porto, Pablo Blanco, Patrick Valduriez

2016): 1329-1340 (PVLDB 9(13))

Abstract:

Supported by increasingly efficient HPC infra-structure, numerical simulations are rapidly expanding to fields such as oil and gas, medicine and meteorology. As simulations become more precise and cover longer periods of time, they may produce files with terabytes of data that need to be efficiently analyzed. In this paper, we investigate techniques for managing such data using an array DBMS. We take advantage of multidimensional arrays that nicely models the dimensions and variables used in numerical simulations. However, a naive approach to map simulation data files may lead to sparse arrays, impacting query response time, in particular, when the simulation uses irregular meshes to model its physical domain. We propose efficient techniques to map coordinate values in numerical simulations to evenly distributed cells in array chunks with the use of equi-depth histograms and space-filling curves. We implemented our techniques in SciDB and, through experiments over real-world data, compared them with two other approaches: row-store and column-store DBMS. The results indicate that multidimensional arrays and column-stores are much faster than a traditional row-store system for queries over a larger amount of simulation data. They also help identifying the scenarios where array DBMSs are most efficient, and those where they are outperformed by column-stores.

2015

A Unifying Model for Representing Time-Varying Graphs

Klaus Wehmuth, Artur Ziviani, Eric Fleury

IEEE International Conference on Data Science and Advanced Analytics - IEEE DSAA 2015, Paris, France

Abstract:

Graph-based models form a fundamental aspect of data representation in Data Sciences and play a key role in modeling complex networked systems. In particular, recently there is an ever-increasing interest in modeling dynamic complex networks, i.e. networks in which the topological structure (nodes and edges) may vary over time. In this context, we propose a novel model for representing finite discrete Time-Varying Graphs (TVGs), which are typically used to model dynamic complex networked systems. We analyze the data structures built from our proposed model and demonstrate that, for most practical cases, the asymptotic memory complexity of our model is in the order of the cardinality of the set of edges. Further, we show that our proposal is an unifying model that can represent several previous (classes of) models for dynamic networks found in the recent literature, which in general are unable to represent each other. In contrast to previous models, our proposal is also able to intrinsically model cyclic (i.e. periodic) behavior in dynamic networks. These representation capabilities attest the expressive power of our proposed unifying model for TVGs. We thus believe our unifying model for TVGs is a step forward in the theoretical foundations for data analysis of complex networked systems.