A Multi-centric Evaluation of Deep Learning Models for Segmentation of COVID-19 Lung Lesions on Chest CT Scans

Background: Chest computed tomography (CT) scan is one of the most common tools used for the diagnosis of patients with coronavirus disease 2019 (COVID-19). While segmentation of COVID-19 lung lesions by radiologists can be time-consuming, the application of advanced deep learning techniques for automated segmentation can be a promising step toward the management of this infection and similar diseases in the future. Objectives: This study aimed to evaluate the performance and generalizability of deep learning-based models for the automated segmentation of COVID-19 lung lesions. Patients and Methods: Four datasets (2 private and 2 public) were used in this study. The first and second private datasets included 297 (147 healthy and 150 COVID-19 cases) and 82 COVID-19 subjects. The public datasets included the COVID19-P20 (20 COVID-19 cases from 2 centers) and the MosMedData datasets (50 COVID-19 patients from a single center). Model comparisons were made based on the Dice similarity coefficient (DSC), receiver operating characteristic (ROC) curve, and area under the curve (AUC). The predicted CT severity scores by the model were compared with those of radiologists by measuring the Pearson’s correlation coefficients (PCC). Also, DSC was used to compare the inter-rater agreement of the model and expert against that of 2 experts on an unseen dataset. Finally, the generalizability of the model was evaluated, and a simple calibration strategy was proposed. Results: The VGG16-UNet model showed the best performance across both private datasets, with a DSC of 84.23% ± 1.73% on the first private dataset and 56.61% ± 1.48% on the second private dataset. Similar results were obtained on public datasets, with a DSC of 60.10% ± 2.34% on the COVID19-P20 dataset and 66.28% ± 2.80% on a combined dataset of COVID19-P20 and MosMedData. The predicted CT severity scores of the model were compared against those of radiologists and were found to be 0.89 and 0.85 on the first private dataset and 0.77 and 0.74 on the second private dataset for the right and left lungs, respectively. Moreover, the model trained on the first private dataset was examined on the second private dataset and compared against the radiologist, which revealed a performance gap of 5.74% based on DSCs. A calibration strategy was employed to reduce this gap to 0.53%. Conclusion: The results demonstrated the potential of the proposed model in localizing COVID-19 lesions on CT scans across multiple datasets; its accuracy competed with the radiologists and could assist them in diagnostic and treatment procedures. The effect of model calibration on the performance of an unseen dataset was also reported, increasing the DSC by more than 5%.