Data Science Platform applied to Health

Site: https://bigdata.icict.fiocruz.br

Data Science is a field of study that stands out for the ability to assist the discovery of useful information from large or complex databases, as well as data-driven decision making. It can be defined as a set of strategies, tools and techniques for collecting, transforming and analyzing data carried out by multidisciplinary teams formed by researchers with substantive knowledge of the problem under analysis - in our case public health - statisticians, mathematicians and computer scientists (date -driven analysis). It combines traditional analysis methods with sophisticated algorithms to process large volumes of data in various formats; structured, semi-structured and unstructured. The process of analysis in the scope of Data Science involves the phases of (i) collection and ingestion: extraction, transformation and load (better known as ETL); (ii) pre-processing: selection of records, reduction of dimensionality, normalization, creation of subsets of data; (iii) exploratory analysis and data mining: mainly analyzes aimed at classification, association, clustering, anomaly detection and prediction; (iv) post-processing: pattern interpretation, filtering, visualization and coupling in decision support systems and online platforms for visualization. The term "Big Data" has been drawing attention away from academic research groups that are at the frontier of knowledge in computer science, particle physics, genetics, and astronomy. To define Big Data, we are certainly talking about a very large volume of data, but in addition to large volumes there are other important characteristics in the composition of the concept. In addition to the volume (which should be "Big") one of its main features is the variety of data to be processed, which can be structured data, semi-structured data and unstructured data (comments on social networks, blogs, websites, Google searches , etc.). Another factor that characterizes Big Data is the speed required to process the large and diverse databases stored and with the possibility of real-time processing. In both cases the innovation lies in the adoption of distributed processing. Within the health sector, it is not difficult to imagine the possibilities of the Data Science approach for analysis, monitoring, prediction of events (cases) and health and disease situations in the population, as well as the association of these with their social determinants. The health sector already produces a huge amount of data about people accessing the SUS, but it is also important to have available information about who has not yet accessed it, and this is only possible with the integration of external databases and real-time processing, such as for example, social networks, blogs and digital media. The adoption of these tools and strategies will force us to modify the way we collect, store, manage, analyze and visualize health data, and data of interest to health.

Members

Marcel de Moraes Pedroso

Other Institutions

Celso Suckow da Fonseca Federal Center for Technological Education

Leopoldo Américo Miguez de Mello Research, Development and Innovation Center

Institut National de Recherche en Informatique et en Automatique