A Systematic Review of 30 Studies Found That Large Language Models Can Match or Outperform Clinicians on Structured Diagnostic Tasks

Why this paper matters

Clinicians are already using LLMs informally, and health systems are building them into workflows. The evidence base for how well these tools perform in diagnostic contexts has accumulated quickly but unevenly, with individual studies ranging from strongly positive to cautionary depending on the specialty, model, and how performance was measured. This systematic review and meta-analysis attempts to synthesize what is actually known across multiple specialties and make it actionable.

What they did

Researchers searched PubMed, Embase, Web of Science, Scopus, and other databases through mid-2024 for studies evaluating LLM performance on clinical diagnostic tasks. Studies were included if they compared LLM diagnostic accuracy against a defined reference standard such as physician diagnosis, pathology confirmation, or structured case resolution. The final analysis included 30 studies published between 2023 and 2025, with a concentration from Japan, the United States, and China. A total of 4,762 cases were analyzed across 19 different LLMs. Studies were included regardless of whether they used multiple-choice or open-ended diagnostic formats, and where available, LLM performance was compared directly against healthcare professionals on the same tasks.

What they found

LLMs demonstrated diagnostic accuracy that varied considerably across specialties, task types, and how diagnostic questions were posed. In studies that included a direct comparison between LLMs and healthcare professionals, LLMs showed competitive performance across multiple specialties, though the margin varied substantially by model and clinical context. Newer, more capable models consistently outperformed earlier versions. Performance was more reliable on structured, clearly defined diagnostic scenarios than on ambiguous presentations or cases involving multiple systems.

What the numbers actually mean

The variation across studies is the most important finding, and the easiest to overlook in a summary. The same model can perform at very different levels depending on whether the task is a multiple-choice clinical vignette or an open-ended diagnostic reasoning exercise under time pressure. Most studies in this review used structured formats. These formats do not capture the ambiguity, incomplete information, patient affect, or time constraints that define actual clinical encounters. That gap is not a minor methodological point — it is the difference between laboratory performance and clinical utility. An LLM that scores well on curated vignettes may still fail in ways that are hard to predict when confronted with a patient whose presentation does not organize itself around a clean diagnosis.

Limitations worth knowing

—The majority of included studies used structured case vignettes rather than data from real clinical encounters, which limits generalizability considerably.
—Publication bias is likely. Studies showing poor LLM performance are less likely to be published, which inflates the apparent accuracy across the literature.
—Prompt design varied substantially across studies, making direct cross-study comparison unreliable and raising questions about whether differences in performance reflect true model capability or differences in how questions were framed.
—None of the included studies measured patient outcomes. Diagnostic accuracy as a process measure is not the same as clinical impact, and the two can diverge in ways that matter.

The bottom line

LLMs perform competitively with clinicians on structured diagnostic tasks in controlled research settings. The conditions that produce those results are sufficiently different from clinical practice that the gap between reported accuracy and real-world utility remains wide. The studies needed to close that gap — prospective trials measuring what actually happens to patients — have not been done. Given how rapidly LLMs and AI are growing in our world today, we should stay vigilant and continue to conduct research on their ability to enhance physician diagnosis, not replace it. Tools like OpenEvidence, which are already being integrated into large hospital health systems, suggest that this area of research will only grow in the near future. CliniScope Weekly will be watching closely.

Paper reviewed

Shan G, Chen X, Wang C, et al. "Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis." JMIR Medical Informatics. 2025;13:e64963. Published 2025 Apr 25. doi:10.2196/64963. Available free full text at: https://pmc.ncbi.nlm.nih.gov/articles/PMC12047852/