Objective
This study aimed to develop machine learning (ML) models to predict HIV status and assessed the factors associated with HIV infection among young men who have sex with men (MSM) under the Universal Health Coverage (UHC) programme in Thailand.
Methods
Young MSM aged 15–24 years who underwent HIV testing through the UHC programme from 2015 to 2022 were included. Data were divided into training (70%) and testing (30%) sets, with the Synthetic Minority Oversampling Technique (SMOTE) applied to address data set imbalance. ML models, including logistic regression, k-nearest neighbour (KNN), random forest, extreme gradient boosting (XGB) and AdaBoost, were used to predict HIV infection.
Results
Among 146 813 young MSM, 11% were diagnosed with HIV. While KNN initially outperformed other ML models, the sensitivity of all models using the original data set was low due to imbalanced data. After applying SMOTE, the XGB model showed the best performance with an accuracy of 0.72, sensitivity of 0.73, specificity of 0.72 and the area under the curve of 0.72. The top predictors of HIV infection were the year of HIV testing (68%), age (55%) and targeted HIV testing (54%).
Discussion
This study demonstrates the potential of ML models, particularly XGB, in predicting HIV infection among young MSM in Thailand under the UHC programme. The application of SMOTE improved model sensitivity, addressing data imbalance and enhancing predictive accuracy.
Conclusions
ML models have the potential to enhance HIV risk assessment and inform targeted prevention strategies for high-risk populations.