The evaluation illusion of large language models in medicine

npj Digital Medicine

Monica Agrawal, Irene Y. Chen, Freya Gulamali & Shalmali Joshi

Discrepancies in data, tasks, automated metrics, and translational impact between the current status quo and real-world deployment lead to insufficient or misleading evaluations of large language models in medical contexts.

Summary

While large language models (LLMs) hold promise for transforming clinical healthcare, current comparisons and benchmark evaluations of large language models in medicine often fail to capture real-world efficacy. Specifically, we highlight how key discrepancies arising from choices of data, tasks, and metrics can limit meaningful assessment of translational impact and cause misleading conclusions. Therefore, we advocate for rigorous, context-aware evaluations and experimental transparency across both research and deployment.

Citation

Agrawal, Monica, et al. “The evaluation illusion of large language models in medicine.” npj Digital Medicine 8.1 (2025): 600.

BibTex

@article{agrawal2025evaluation, title={The evaluation illusion of large language models in medicine}, author={Agrawal, Monica and Chen, Irene Y and Gulamali, Freya and Joshi, Shalmali}, journal={npj Digital Medicine}, volume={8}, number={1}, pages={600}, year={2025}, publisher={Nature Publishing Group UK London} }

Collaborators:

Referenced Research: