Objectives
To evaluate the feasibility and performance of a large language model (LLM)-based artificial intelligence (AI) agent, implemented within a structured Claim–Argument–Evidence System (CAES), for supporting the review of clinical quality measure (CQM) evidence in the Centers for Medicare & Medicaid Services Consensus-Based Entity (CBE) endorsement process.
Methods
The CBE conducted a pilot study using a previously endorsed measure. CAES extracted claims and citations from a submitted diagnostic performance measure for pneumonia, automatically retrieved additional relevant evidence from PubMed abstracts and assessed the quality, confidence and agreement of evidence supporting each claim. The system’s assessments were compared with the judgement of a subject matter expert (SME).
Results
CAES completed the assessment in approximately 5 hours. The SME agreed with the CAES-assigned claim statuses for 69% of claims, was neutral for 11% and disagreed for 14%. Disagreements primarily stemmed from the need for contextual interpretation beyond abstracts.
Discussion
Manual evaluation of CQM evidence requires significant time and resources, estimated at over 2400 labour hours per review cycle, limiting efficiency and transparency.
The AI agent evaluated 64 claims and 355 claim–evidence pairs related to the pneumonia diagnosis measure. It assigned claim statuses based on evidence strength and generated justifications.
Conclusion
This pilot demonstrated the feasibility and potential of LLM-based AI agents to improve the efficiency and transparency of evidence review for CQMs. Further development is needed to incorporate additional data sources and extend applicability across the measure development lifecycle.