Benchmarks in medicine: the promise and pitfalls of evaluating AI tools with mismatched yardsticks


In May 2024, OpenAI released HealthBench, a new benchmarking system to test the clinical capabilities of large language models (LLMs) such as ChatGPT. On the surface, this may sound like yet another technical update. But for the medical world, it marked an important moment—a quiet acknowledgement that our current ways of evaluating medical AI are fundamentally wrong.

Headlines in the recent past have trumpeted that AI “outperforms doctors” or “aces medical exams.” The impression that’s coming through is these models are smarter, faster, and perhaps even safer. But this hype masks a deeper truth. To put it plainly, the benchmarks used to arrive at these claims are based on exams built for evaluating human memory retention from classroom teachings. They reward fact recall, not clinical judgment.

A calculator problem

A calculator can multiply two six-digit numbers within seconds. Impressive, no doubt. But does this mean calculators are better than, and understand maths more than mathematics experts ? Or better even than an ordinary person who takes a few minutes to do the calculation with a pen and paper?

Language models are celebrated because they can churn out textbook-style answers to MCQs and fill in the blanks for medical facts and questions faster than medical professors. But the practice of medicine is not a quiz. Real doctors deal with ambiguity, emotion, and decision-making under uncertainty. They listen, observe, and adapt.

The irony is that while AI beats doctors in answering questions, it still struggles to generate the very case vignettes that form the basis of those questions. Writing a good clinical scenario from real patients in clinical practice requires understanding human suffering, filtering irrelevant details, and framing the diagnostic dilemma with context. So far, that remains a deeply human ability.

Also Read: Why AI in healthcare needs stringent safety protocols

What existing benchmarks miss

Most widely-used benchmarks—MedQA, PubMedQA, MultiMedQA—pose structured questions with one “correct” answer or have fill in the blanks questions. They evaluate factual accuracy but overlook human nuance. A patient doesn’t say, “I have been using a faulty chair and sitting in the wrong posture for long hours and have a non-specific backache ever since I bought it. So please choose the best diagnosis and give appropriate treatment.” They just say, “Doctor, I’m tired. I don’t feel like myself.” That is where the real work begins.

Clinical environments are messy. Doctors deal with overlapping illnesses, vague symptoms, incomplete notes, and patients who may be unable—or unwilling—to tell the full story. Communication gaps, emotional distress, and even socio-cultural factors influence how care unfolds. And yet, our evaluation metrics continue to look for precision, clarity, and correctness—things that the real world rarely provides.

Benchmarking vs reality

It can be easy to decide who the best batter in the world is, by only counting runs scored. Similarly, bowlers can be ranked by the number of wickets taken. But answering the question “Who is the best fielder?” might not be as simple. Measuring fielding is very subjective and evades simple numbers. The number of runs outs assisted or catches taken only tells part of the story. The efforts made at the boundary line to reduce runs or mere intimidation through the presence of the fielders (like Jonty Rhodes or R. Jadeja) preventing runs at covers or points can’t be measured easily.

Healthcare is like fielding: it is qualitative, often invisible, deeply contextual, and hard to quantify. Any benchmark that pretends otherwise will mislead more than it illuminates.

This is not a new problem. In 1946, the civil servant Sir Joseph Bhore, when consulted to reform India’s healthcare said, “If it were possible to evaluate the loss, which this country annually suffers through the avoidable waste of valuable human material and the lowering of human efficiency through malnutrition and preventable morbidity, we feel that the result would be so startling that the whole country would be aroused and would not rest until a radical change had been brought about”. This quote reflects a longstanding dilemma—how to measure what truly matters in health systems. Even after 80 years, we have not found perfect evaluation metrics.

What HealthBench does

HealthBench at least acknowledges this disconnect. Developed by OpenAI in collaboration with clinicians, it moves away from traditional multiple-choice formats. It is also the first benchmark to explicitly score responses using 48,562 unique rubric criteriaranging from minus 10 to plus 10, reflecting some aspects of real-world stakes of clinical decision-making. A dangerously wrong answer must be punished more harshly than a mildly useful one. This, finally, mirrors medicine’s moral landscape.

Even so, HealthBench has limitations. It evaluates performance across just 5,000 “simulated” clinical cases, of which only 1,000 are classified as “difficult.” That is a vanishingly small slice of clinical complexity. Though commendably global, its doctor-rater pool includes just 262 physicians from 60 countries in 52 languages, with varying professional experience and cultural backgrounds (three Physicians from India participated, and simulations from 11 Indian languages were generated). HealthBench Hard, a challenging subset of 1,000 cases, revealed that many existing models scored zero—highlighting their inability to handle complex clinical reasoning. Moreover, these cases are still simulations. Thus, the benchmark is an improvement, not a revolution.

Also Read: Artificial Intelligence in healthcare: what lies ahead

Predictive AI’s collapse in the real world

This is not just about LLMs. Predictive models have faced similar failures. The sepsis prediction tool, developed by EPIC to flag early signs of sepsis, showed initial promise a few years ago. However, once deployed, it could not meaningfully improve outcomes. Another company that claimed to have developed a detection algorithm for liver transplantation recipients folded quietly after its model showed bias against young patients in Britain. It failed in the real world despite glowing performances on benchmark datasets. Why? Because predicting rare/critical events requires context-aware decision-making. A seemingly unknown determinant may lead to wrong predictions and unnecessary ICU admissions. The cost of error is high—and humans often bear it.

What makes a good benchmark?

A robust medical benchmark should meet four criteria:

Represent reality: Include incomplete records, contradictory symptoms, and noisy environments.

Test communication: Measure how well a model explains its reasoning, not just what answer it gives.

Handle edge cases: Evaluate performance on rare, ethically complex, or emotionally charged scenarios.

Reward safety over certainty: Penalise overconfident wrong answers more than humble uncertainty.

Currently, most benchmarks miss these criteria. And without these elements, we risk trusting technically smart but clinically naïve models.

Red teaming the models

One way forward is red teaming—a method borrowed from cybersecurity, where systems are tested against ambiguous, edge-case, or morally complex scenarios. For example: a patient in mental distress whose symptoms may be somatic; an undocumented illegal immigrant fearful of disclosing travel history; a child with vague neurological symptoms and an anxious parent pushing for a CT scan; a pregnant woman with religious objections to blood transfusion; a terminal cancer patient is unsure whether to pursue aggressive treatment or palliative care; a patient feigning for personal gain.

In these edge cases, models must go beyond knowledge. They must display judgment—or, at the very least, know when they don’t know. Red teaming does not replace benchmarks. But it adds a deeper layer, exposing overconfidence, unsafe logic, or lack of cultural sensitivity. These flaws matter more than ticking the right answer box in real-world medicine. Red teaming forces models to reveal what they know and how they think. It uncovers these aspects, which may be hidden in benchmark scores.

Why this matters

The core tension is this: medicine is not just about getting answers right. It is about getting people right. Doctors are trained to deal with doubts, handle exceptions, and recognise cultural patterns not taught in books (doctors also miss a lot). AI, by contrast, is only as good as the data it has seen and the questions it has been trained on. HealthBench, for all its flaws, is a small but vital course correction. It recognises that evaluation needs to change. It introduces a better scoring rubric. It asks harder questions. That makes it better. But we must remain cautious. Healthcare is not like image recognition or language translation. A single incorrect model output can mean a lost life and a ripple effect—misdiagnoses, lawsuits, data breaches, and even health crises. In the age of data poisoning and model hallucination, the stakes are existential.

The road ahead

We must stop asking if AI is better than doctors. That is not the right question. Instead, we should ask: Where is AI safe, useful, and ethical to deploy—and where is it not? Benchmarks, if thoughtfully redesigned, can help answer that. AI in healthcare is not a competition to win. It is a responsibility to share. We must stop treating model performance as a leaderboard sport and start thinking of it as a safety checklist. Until then, AI can assist. It can summarise. It can remind. However, it cannot replace clinical judgment’s moral and emotional weight. It certainly cannot sit beside a dying patient and know when to speak and when to stay silent.

(Dr. C. Aravinda is an academic and public health physician. The views expressed are personal. [email protected])



Source link

Leave a Comment

Scroll to Top
Receive the latest news

Subscribe To Our Weekly Newsletter

Get notified about new articles