How to tackle the challenges of Data Integration

PRECISE4Q sets out to create multi-dimensional data-driven predictive simulation models for stroke. These will address patients’ needs at different stages: prevention, acute treatment, rehabilitation and reintegration. The overarching goal is to enable personalised stroke treatment and contribute to minimising the burden of stroke on the individual and society.

To achieve all this, the project will first centrally collect, harmonise and integrate heterogenous data from multidisciplinary sources such as electronic health records, national health registries, biobanks and health insurance data. The resulting collection of data will feature a wide range of formats, from genomics and to microbiomics data, lab data, imaging data; and social data such as lifestyle, gender; economic and occupational factors

The project thus sets out to build a complex data infrastructure. Because of the varied provenance of data, these may not all be fully characterized from the get-go and many specifications may not be complete. To address such challenges, an agile approach to data management has been adopted, by creating a data lake. Working iteratively, substantial efforts will be channelled into information extraction, semantic labelling and the standardization of the differently sourced data that becomes available in the lake.

The methods employed will combine the multiple sources and representations of data into a form where items of data share meaning. This process is called semantic harmonization. Data will be annotated with metadata and mapped onto the PRECISE4Q ontology. It will construct a thesaurus for the languages of the used data sources and, consequently, a common ontology-based data model. Natural language processing will be used to structure clinical texts. This process will also contribute to establishing the requirements for data integration and modelling.

Furthermore, all changes to the data will be recorded. An overview of provenance and data ownership will be maintained at all times. It will also be possible to sort data characterisations by source. The iterations of enriched data will then be used in Machine Learning training and predictive modelling.

Due to different data privacy constraints, however, there will be to separate the data storage for use of preprocessing (annotating and training) and for the use of clinical application. As a result, the data integration task involves different data transmission pipelines. For clinical purposes, the “prediction knowledge repository” will be linked to a subset of the clinical data to map the predictive modelling to a concrete, specific clinical case.