Objective
To expose reasoning pathways of a reinforcement learning policy for Medicaid care coordination, develop an error taxonomy and implement fairness-aware guardrails.
Design
Retrospective interpretability audit using attention analysis, Shapley explanations, sparse autoencoder feature discovery and blinded clinician adjudication.
Setting
Medicaid care coordination programmes in Washington, Virginia and Ohio (July 2023–June 2025).
Participants
250 000 intervention decisions; 200 divergent cases reviewed by five clinicians.
Main outcome measures
Calibrated harm prediction; algorithmic clearance and residual harm rates; error taxonomy frequencies; subgroup fairness metrics.
Results
The conformal model achieved area under the receiver operating characteristic curve of 0.80 (95% CI 0.78 to 0.82), clearing 89.5% (95% CI 88.9% to 90.1%) of decisions with 1.22% (95% CI 1.14% to 1.30%) residual harm versus 6.67% (95% CI 6.02% to 7.32%) for flagged decisions. Sparse autoencoders identified seven reasoning motifs linking social determinants to clinical cascades. The error taxonomy revealed premise errors (48%, 95% CI 41% to 55%), calibration failures (27%, 95% CI 21% to 33%) and contextual blind spots (25%, 95% CI 19% to 31%). Divergence was higher for telehealth visits (11.2%) and behavioural health patients (10.7% vs 6.9%, p<0.001). Fairness optimisation reduced race-group disparity by 37% (95% CI 22% to 48%) and sex-group disparity by 28% (95% CI 14% to 39%). Reviewers rated 23% (95% CI 17% to 29%) of overridden recommendations as well-matched, confirming appropriate human oversight.
Conclusions
Mechanistic interpretability transforms opaque algorithmic assistance into auditable decision support, providing a governance scaffold for clinical artificial intelligence deployment.