Description


QEF is a Query Evaluation Framework that evolved from CoDIMS (Configurable Data Integration Middleware System). It is a middleware environment for the generation of adaptable and configurable data integration middleware systems. Data integration systems were designed to provide an integrated global view of data and programs published by heterogeneous and distributed data sources. Applications benefit from these types of system by transparently accessing resources independently of their localization, data model and original data structure.

The QEF framework helps users to define and execute several types of requests. By "request'', we mean a set of user defined workflows. The system manages the execution of the request in a distributed environment (Grid environment), the communication between query execution components, and the access to heterogeneous data sources.

Using QEF, applications can access transparently heterogeneous data stored at different locations. The idea is to provide a DataSource interface between the data and user's applications. This interface abstracts the format and the location of the data and eases the way the user accesses his data (files, databases, URLs, etc.), i.e. the user reads uniformly from DataSource wrappers without considering types or locations.

The extension of QEF to support SV techniques requires implementing an extension to the framework class DataSource to each external data to be processed. The extended class transforms the external data into an in-memory instance of a data-type. The mapping of the external data to an in-memory data-type instance may not be trivial and care must be taken when dealing with potential large data-sets. In this context, the scientist may decide to partition the data set into smaller processing units or to adopt pointers to the original data.

The framework provides an easy way for users to execute requests. In the QEF, a request is a set of user-defined algebraic operations, communicating with each other and aiming to produce a result. These operations are represented by a Query Execution Plan (QEP). Each operator consumes a tuple and produces a modified tuple. Hence, during the QEP execution tuples flow from one operator to another one in the tree; we call this dataflow the producer-consumer model.

A QEF is represented as a tree where the nodes are operators, the leaves are the data sources, and the edges are the relationship between operators in the producer-consumer form. We consider trees presenting linear or bushy topologies. Concretely, the QEP is an XML file composed of a list of operator templates, where each operator is defined by an id, a name, a list of producers and a list of parameters.

There are two types of operators in QEF: Algebraic and Control. Algebraic operators implement the algebra of a data model; they act on the content of a tuple, processing according to the desired semantics. Control operators, on the other hand, are Meta-operators that implement an execution characteristic, associated with the tuple dataflow. Combining those two types of operators, applications can support different execution workflows and allows the system to transparently decide on an execution strategy keeping intact the execution semantics.

One important aspect in QEF is the distribution of the execution over a Grid environment. Applications benefit from this type of mechanism by reducing query (request) evaluation time. In the QEF, a Query Optimizer decides which operators of the QEP should be parallelized and the set of available grid nodes to be used, based on the cost of the operator on each remote node. Figure 6 shows an operator parallelized over three nodes.

When an operator is parallelized over remote nodes, the management of the communication between the nodes is performed by the central node (Master node). A Query Optimizer runs on the central node and calls G2N to take the parallelization decision. The main idea behind this algorithm is that given a set of remote nodes and a set of tasks it defines a subset of nodes on which the task should be parallelized. G2N also manages the data transfer between the remote nodes and the central one, i.e. it decides how many tuples should be sent to each node for processing.

In order a distributed executing to happen, the initial QEP is transformed, by the introduction of new control operators. Figure \ref{paralelizacao} shows an initial QEP transformed to handle the distribution of an abstract operator B over two remote nodes. Control operators are represented in dark. In this case, the Optimizer adds the following control operators:

Receiver (R) and Sender (S): allows exchanging of data between nodes;
Instance to Block (I2B) and Block to Instance (B2I): aggregate and disaggregate tuples into blocks to optimize data transfer;
Split (Sp) e Merge (Me): to send and receive blocks from multiple nodes.

Team

LNCC Computer Science Department

Douglas Oliveira - ericson@lncc.br
Fábio Porto - fporto@lncc.br


Acknowledgment

This work was partially funded by CNPq, edict Research Productivity - 2009, process:309502/2009-8, and by PCI


Download

QEF Manual

QEF