Science and Research

Core Datasets

Core datasets within the DZL and its disease areas

Having a common understanding of which information are essential for diagnosis, prognosis, and treatment of certain diseases. It allows scientists to evaluate their data collections and make them sustainable for future cross-site research. Additionally, such parameter lists facilitate the process of data integration in the DZL data warehouse. Since 2016, the DZL central data management endeavours to discover and connect meaningful datasets from all participating sites and associated partners in the context of all eight disease areas. The parameter lists, from now on referred to as “core datasets”, allow us to assert a minimal grade of data completeness and guarantee data quality to those scientists who want to make research based on patient sets they identified with the data warehouse query tool. Regarding future data collections for studies and/or registers, these core datasets may serve as a starting point.

During the DZL annual meeting in 2020, the DZL Platform Biobanking & Data Management sent members to each disease area meeting in order to point out the importance of core datasets and ask for respective propositions. For guidance, we later gave a definition clarifying the purpose and requirements for such parameter lists. Two years later and with the help of representatives of all eight disease areas, our efforts result in ten distinct lists of parameters. Within the disease area ALI/ARDS, we decided to define an additional separate dataset for pneumonia, resulting in a total of nine disease area specific core datasets. Elements that occur in the lists of at least two disease areas were put together in the DZL core dataset.

Definition of a Core Dataset

A disease area-specific dataset definition is required to enable cross-dataset evaluations within the disease area as well as across disease areas and to ensure good data quality.

The dataset definition MUST contain all information required for a reliable diagnosis. These are parameters and criteria that are essential for correct phenotyping according to current guidelines.

The dataset definition SHOULD include all relevant information on symptom burden, prognosis, quality of life (e.g. EQ5D), inclusion criteria as well as longitudinal course and extent of treatment.

More details on the dataset composition can be found in the publication “Definition, Composition, and Harmonization of Core Datasets Within the German Center for Lung Research” (

DZL Engagements