PURPOSE: This study aims to investigate whether a diagnostic AI model can effectively support lesion detection and staging in non-small cell lung cancer (NSCLC) [(1)⁸F]FDG PET/CT studies, focusing on the distinction between technical segmentation accuracy and clinically meaningful performance. METHODS: In this retrospective single-centre study, [(1)⁸F]FDG PET/CT scans from 306 treatment-naïve NSCLC patients were reviewed with reference to multidisciplinary team decisions. Tumour lesions were manually segmented for reference and compared with predictions from the top-performing algorithm of the autoPET III challenge. Quantitative segmentation metrics were calculated, and lesion-level errors were assessed for impact on patient-level TNM and UICC staging. RESULTS: The algorithm achieved a mean Dice Similarity Coefficient (DSC) of 0.64. Lesion-level sensitivity was 95.8% across all patients, with a precision of 87.5%. False positive M-category lesions (n = 196) occurred as most frequent error. Of all false positives, 35.7% were benign and 34.7% non-oncologic pathologies. UICC staging matched ground truth in 207/306 patients, with most discordances due to upstaging (88/306). CONCLUSION: Clinically driven metrics and cause-based error analysis offer valuable insight into AI segmentation performance. The evaluated model showed excellent lesion sensitivity but a tendency towards systematic overprediction across TNM categories. On a lesion level M-stage false positives and undersegmentation in the hilar region emerged as the main driver of clinically relevant upstaging. Despite promising lesion detection sensitivity, only 67.7% UICC-stagings were accurate using AI masks, indicating that diagnostic AI may support, though not yet replace, manual lesion evaluation in NSCLC [(1)⁸F]FDG PET/CT.
Keywords
