GYPSCIE - Model Management Project

The increasing availability of data in digital form produced from all human actions drives the development of a new society[Har16]. In the sciences, the phenomenon became known as the 4º Paradigm[HTT09]. In this, in-vivo experiments are carried out in-silico, that is, through analysis performed on the data obtained experimentally or via simulation. In industry, microsensors scattered throughout the physical environment, whether in an industrial plant or in oil wells, provide measurements of various variables at high frequency, allowing monitoring and decision making on the production processes. In this context, advanced models of data-based decision support are emerging. Techniques generally known as Machine Learning[SB17] recognize the world from learning tasks based on samples of states provided by data. Different algorithms have been proposed according to this model, such as decision trees and deep neural networks, besides statistical models and time series. Basically, the techniques can be identified in two large classes, depending on the type of learning performed: classification (when the value to be predicted is a label) or regression (when the value to be predicted is a continuous magnitude).

So, let's take the following general definition for the learning problem, as [SB17]: a learning algorithm receives a set of input “S”, obtained as a sampling of a unknown distribution ”D” and labeled by a target function “f”. The algorithm must produce a predictor hS: χ→ Y (the subscript ”S” emphasizes the fact that the predictor depends on training in ”S”). The objective of the algorithm is to find an hS that minimizes the error in relation to ”D” and “f”, which are unknown. The definition of the error is an important task in this algorithm. Once an error is admitted, there are infinite functions ”hS” candidates for predictors. In these cases, we adopt constraints on the learning process that allows it to converge to a ”hS” that minimizes the error value.

In this project, however, we will relax this last observation and consider that there exists a set H = {h1,h2,···,hn} of predictors for a given pair (D,f), as expressed above. This premise is justified in the independent search for the creation of models by several competing groups, modeling the same domain, as occurs between different groups in Petrobras. In this sense, there are predictive models of the same phenomenon (D,f). The creation of models through different learning algorithms, different domain sampling and with different validation data suggests models with different qualities and different potentials in their applicability.

From this premise, new challenges open up, in what we call this template management (GM) project. At GM, it is desired to extract the best from the collection of available models, making it the choice not only for a single bias of appreciation but under comparative optics more specific and appropriate to its application.

We can then list some desirable characteristics in a system that provides management of models, such as:

- Identification and characterization of a learning model;
- Search for models;
- Comparison between models;
- Reuse of models;
- Model data management;
- Interoperability in templates.