Another paper (Ramaswamy et al 2026) questioning whether generative large language models can reliably be used as tools for medical decision making, in alignment with the theoretical arguments made in my paper. I further suggest that this is not just a case-by-case phenomenon; it’s a fundamental structural characteristic of generative LLMs, and the barriers to using a generative model for decision making are closely related to the problems that are solved in evidence based medicine. For this reason, I urge medical researchers who are currently focusing on problems in evidence-based medicine (trials, prediction modeling, etc) to continue their research. This is not to say that generative models are not useful in medicine. As I argue here, it is just a matter of using them for the correct tasks.
References
Weisenthal, Samuel J. “Treatment, evidence, imitation, and chat.” arXiv preprint arXiv:2506.23040 (2025). https://arxiv.org/html/2506.23040v2
Ramaswamy, A., Tyagi, A., Hugo, H. et al. ChatGPT Health performance in a structured test of triage recommendations. Nat Med (2026). https://doi.org/10.1038/s41591-026-04297-7.
Leave a comment