Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions

A VPN is an essential component of IT security, whether you’re just starting a business or are already up and running. Most business interactions and transactions happen online and VPN

Objective

The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.

Methods

This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.

Results

Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8x7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.

Conclusions

10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.

Bruneti Severino, J. V., Basei de Paula, P. A., Berger, M. N., Loures, F. S., Todeschini, S. A., Roeder, E. A., Veiga, M. H., Guedes, M., Marques, G. L.

Bruneti Severino, J. V., Basei de Paula, P. A., Berger, M. N., Loures, F. S., Todeschini, S. A., Roeder, E. A., Veiga, M. H., Guedes, M., Marques, G. L.

Leave a Replay

Sign up for our Newsletter

Contact Us