Objectives
To evaluate the ability of large language models (LLMs) to simulate multidisciplinary team (MDT) decision-making in colorectal cancer, a malignancy that often requires complex treatment planning.
Methods
We retrospectively analysed 1423 colorectal cancer cases discussed at MDT meetings at Peking University Cancer Hospital between January 2023 and December 2024. Three LLMs—OpenAI o3-mini-2025-01-31, DeepSeek-R1 671b and Qwen qwq-plus-2025-03-05—were tested for their ability to replicate MDT recommendations using a standardised treatment categorisation framework. Each case was processed three times per model; only cases with consistent outputs across all three runs were included. Concordance between AI-generated decisions and expert MDT consensus was assessed using agreement percentages and Cohen’s kappa.
Results
O3 demonstrated the highest intramodel stability, with an agreement rate of 81.0% (Fleiss’ kappa=0.794), yielding 1153 cases with consistent outputs. Concordance with MDT consensus was comparable across the three models, ranging from 62.5% to 65.4%. Multivariable analysis of O3 outputs identified treatment-naïve status, non-metastatic disease and colon tumour location as independent predictors of higher concordance with experts.
Discussion
LLMs showed fair overall agreement with expert MDT decisions, with stronger performance in standardised and less complex clinical scenarios. Areas of higher concordance included treatment-naïve non-metastatic colon cancer, treated non-metastatic rectal cancer and treated non-metastatic colon cancer.
Conclusion
LLMs can partially replicate expert MDT recommendations in colorectal cancer. Their integration into clinical workflows should aim to complement, rather than replace, human expertise.