Forschungsdaten-Kolloquium: Datenqualitätsframework im DIFUTURE-Konsortium

Datum: 21. Juli 2023Zeit: 14:00 – 16:00Ort: Seminarraum der Informatik 6 (Martensstraße 3, Raum 08.130)

Clinical and translational data warehouses are important infrastructure building blocks for modern data-driven approaches in medical research. These analytics-oriented databases have been designed to integrate heterogeneous biomedical datasets from different sources and to support use cases such as cohort selection and ad-hoc data analyses. However, the lack of clear definitions of source data and controlled data collection procedures often raises concerns about the quality of data provided in such environments and, consequently, about the evidence level of related findings. To address these problems, we present an architecture that helps to monitor data quality issues when importing data into warehousing solutions using ETL (Extraction, Transformation, Load) processes. Our approach provides software developers with an API (Application Programming Interface) for logging detailed and structured information about data quality issues encountered. This information can then be displayed in dynamic dashboards, the evolution of data quality can be monitored over time, and quality issues can be traced back to their source. Our architecture supports several well-known data quality dimensions, addressing conformance, completeness, and plausibility. We present an open-source implementation, which is compatible with common clinical and translational data warehousing platforms, such as i2b2 and tranSMART, and which can be used in conjunction with many ETL environments.

