COPD Machine Learning Datasets

This website describes a collection of feature datasets, derived from chest computed tomography (CT) images, which can be used in the diagnosis of chronic obstructive pulmonary disease (COPD).

The images in this database are weakly labeled, i.e. per image, a diagnosis (COPD or no COPD) is given, but it is not known which parts of the lungs are affected. Furthermore, the images were acquired at different sites and with different scanners. These problems are related to two learning scenarios in machine learning, namely multiple instance learning or weakly supervised learning, and transfer learning or domain adaptation.

These problems are receiving quite a lot of interest from the machine learning community, but are more recent in medical image analysis. However, the problems is very important in practice, if machine learning methods are to be translated to the clinic. By publicly releasing these feature datasets, we hope that the problem of classifying COPD will receive more attention from both communities.


The database can be used free of charge for research and educational purposes. Redistribution and commercial use is not permitted. If you publish using data from this website (journal publications, conference papers, abstracts, technical reports, etc.), please cite the following paper:

Veronika Cheplygina, Isabel Pino Peña, Jesper Holst Pedersen, David A. Lynch, Lauge Sørensen and Marleen de Bruijne. Transfer learning for multi-center classification of chronic obstructive pulmonary disease. In Journal of Biomedical and Health Informatics, in press, 2017.

We kindly ask you to register your details at before you can download the data. This is to keep you updated should there be any changes to the data, any related publications, etc, that you might want to be informed of.

If you use this data in a publication, we would appreciate it if a reference to the publication using the data be forwarded to the following email address: v.cheplygina (at) tue (dot) nl. This information will be added to the list of studies using data from this database at the bottom of this website.

Data description

The dataset contains derived features from CT images of patients and controls scanned at different centers, with different scanners and scanning parameters.

Each image is represented by 50 feature vectors, where each feature vector describes a volumetric ROIs of size 41 x 41x 41 voxels, extracted at random locations inside the lung mask. Two different feature types are available:

_gss.txt Gaussian scale space features, or histograms of intensity values in the ROI after filtering the image. Here we use eight fi lters (smoothed image, gradient magnitude, Laplacian of Gaussian, three eigenvalues of the Hessian, Gaussian curvature and eigen magnitude), four scales (0.6, 1.2, 2.4 and 4.8 mm), and histograms of ten bins. There are 320 features in total. The file feature_names.txt specifies the ordering of the filters and the scales.

_kdei.txt Kernel density estimation features, or a histogram of intensity values between -1100 and -600 Hounsfield units in the ROI. There are 256 features in total.

Per data subset, two additional files are available:

_index.txt The ID of the subject that each ROI belongs to

_labels.txt The label of each subject, determined by the subject’s diagnosis: COPD (1) or non-COPD (0). COPD diagnosis is determined according to the Global Initiative for Chronic Obstructive Lung Disease (GOLD) criteria (FEV1/FVC < 0.7).


Publications using this database:

V. Cheplygina, I. Pino Peña, J. H. Pedersen, D. A. Lynch, L. Sørensen and M. de Bruijne. Transfer learning for multi-center classification of chronic obstructive pulmonary disease. In Journal of Biomedical and Health Informatics, in press, 2017.

Publications using similar features:

V. Cheplygina, L. Sørensen, D.M.J. Tax, M. de Bruijne, and M. Loog, Label Stability in Multiple Instance Learning, Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015

V. Cheplygina, L. Sørensen, D.M.J. Tax, J.H. Pedersen, M. Loog, and M. de Bruijne, Classification of COPD with Multiple Instance Learning, International Conference on Pattern Recognition (ICPR), 2014.

L. Sørensen, M. Nielsen, P. Lo, H. Ashraf, J.H. Pedersen, and M. de Bruijne, Texture-Based Analysis of COPD: a Data-Driven Approach, IEEE Transactions on Medical Imaging 31(1): 70-78, 2012.

Pedersen, J.H., Ashraf, H., Dirksen, A., Bach, K., Hansen, H., Toennesen, P., Thorsen, H., Brodersen, J., Skov, B.G., Døssing, M. and Mortensen, J., 2009. The Danish randomized lung cancer CT screening trial—overall design and results of the prevalence round. Journal of Thoracic Oncology, 4(5), pp.608-614.


We thank Jesper Holst Pedersen (Department of Thoracic Surgery, Rigshospitalet, University of Copenhagen), Morten Vuust (Department of Diagnostic Imaging, Vendsyssel Hospital, Frederikshavn) and Ulla Møller Weinreich (Department of Pulmonology Medicine and Clinical Institute at Aalborg University Hospital) for their assistance in the acquisition of the CT images. We also thank the Danish Council for Independent Research.


If you have any questions or comments, please contact Veronika Cheplygina (v.cheplygina (at) tue (dot) nl )