A team of researchers at Babylon Health, the well-funded UK-based startup that facilitates telemedical consultations between patients and health experts, claim that they’ve developed an AI system capable of matching expert clinician decisions in 85% of cases. If it holds up to scrutiny, the system could lift a burden off of the overloaded U.S. health care system, which is anticipated to face a shortfall of between 21,000 and 55,000 primary care doctors by 2023.
Triaging in this context refers to the process of uncovering enough medical evidence to determine the appropriate point of care given a patient’s presentation. Clinicians plan a sequence of questions in order to make a fast and accurate decision, inferring about the causes of a condition and updating their plan following each new piece of information.
The Babylon Health team sought an automated approach built upon reinforcement learning, an AI training paradigm that spurs software agents to complete tasks via a system of rewards. They combined this with judgments from medical experts made over a data set of patient presentations, which encapsulated roughly 597 elements of observable symptoms or risk factors.
The researchers’ AI agent — a Deep Q network — learned an optimized policy based on 1,374 expert-crafted clinical vignettes. Each vignette was associated with an average of 3.36 expert triage decisions made by separate clinicians, and the validity of each vignette was reviewed independently with two clinicians.
At each step, the agent asks for more information or makes one of four triage decisions. And at each new episode, the training environment is configured with a new clinical vignette. Then the said environment processes evidence and triage decisions on the vignette and returns a value, such that if the agent picks a triage action, it receives a final reward.
To validate the system, the researchers evaluated the model on a test set of 126 previously unseen vignettes using three target metrics: appropriateness, safety, and the average number of questions asked (between 0 and 23). During training on 1,248 vignettes, those metrics were evaluated over a sliding window of 20 vignettes, and during testing, they were evaluated over the whole test set.
The team reports that the best-performing model achieved an appropriateness score of .85 and a safety score of 0.93, and it asked an average of 13.34 (0.875). That’s on par with the human baseline (0.84 appropriateness, 0.93 safety, and all 23 questions).
“By learning when best to stop asking questions given a patient presentation, the [system] is able to produce an optimized policy which reaches the same performance as supervised methods while requiring less evidence. It improves upon clinician policies by combining information from several experts for each of the clinical presentations.” wrote the paper’s coauthors, who point out that the agent isn’t trained to ask specific questions and can be used in conjunction with any question-answering system. “This … approach can produce triage policies tailored to health care settings with specific triage needs.”
It’s worth noting that Babylon Health, which is backed by the UK’s National Health Service (NHS), has flirted with controversy. Nearly three years ago, it tried and failed to gain a legal injunction to block publication of a report from the NHS care standards watchdog. In February, it publicly attacked a UK doctor who raised around 100 test results he considered concerning. And it recently received a reprimand from UK regulators for promoting misleading advertising.
The thoroughness of its studies has also been called into question.
The Royal College of General Practitioners, the British Medical Association, Fraser and Wong, the Royal College of Physicians issued statements questioning claims in a 2018 paper published by Babylon researchers, which asserted that its AI could diagnose common diseases as well as human physicians. “[There is no evidence] can perform better than doctors in any realistic situation, and there is a possibility that it might perform significantly worse,” wrote the coauthors of a 2018 paper published in The Lancet. “Symptom checkers bring additional challenges because of heterogeneity in their context of use and experience of patients.”
In response to the criticism, Babylon said that “[s]ome media outlets may have misinterpreted what was claimed” but that it “[stood] by [its] original science and results.” It described the 2018 test as a “preliminary piece of work” that pitted the company’s AI against a “small sample of doctors,” and it referred to the study’s conclusion: “Further studies using larger, real-world cohorts will be required to demonstrate the relative performance of these systems to human doctors.”
In this latest paper, Babylon disclosed that the chief investigator and most coinvestigators were paid employees.