Say you’re a doctor. You’d like guidance on how to treat a particular patient, and you have the opportunity to query a group of physicians about what they’d do next.
Who do you include in that group? All doctors who have seen similar patients? A smaller number of doctors who are considered the best?
These questions are at the root of recent Stanford research.
The study, published in the Journal of Biomedical Informatics, examines a key aspect of medical artificial intelligence: If machines are to provide advice for patient care, who should those machines be learning from?
Jonathan Chen, MD, PhD, tackled this question as the latest step in his quest to build OrderRex, a tool that will mine data from electronic health records to inform medical decisions. In a prototype, an algorithm that works like Amazon’s recommendation feature supplied information to doctors on how their peers managed similar patient cases — i.e., it tells users that other doctors in this situation ordered this medication or this test.
Now, Chen and his colleagues are working through logistics of how the computer decides what to recommend. Research assistant Jason K. Wang is the study’s lead author. Chen, the senior author, told me,
The distinguishing question was, if you had to learn medical practices, should you just try to get everybody’s data and learn from everybody? Or should you try to find the ‘good’ doctors? Then that evokes some really deep follow-up questions, such as, what is a preferred doctor or a good expert? What does that mean? How would we define that?
He asked multiple doctors a simplified version of these questions — if you had to grade each doctor by a number, how would you come by that number? — and tumbled into the usual thicket surrounding physician evaluations. Which outcomes should be considered? What if you see sicker patients? What about hospital readmission, length of stay, and patient satisfaction?
Ultimately, Chen and his colleagues settled on the measure of 30-day patient mortality rate: actual number of deaths compared to expected deaths, as reliably calculated by a computer using three years of electronic health record data.
“We tried a couple of things, but found they were correlated enough that it basically didn’t matter. The doctors who rose to the top of the list versus the bottom — it was a pretty stable list,” Chen said. “Plus, you can’t really game mortality: it’s the most patient-centered outcome there is.”
With “expert” doctors identified as those with low patient mortality rates, the researchers turned to defining “optimal” care to compare with recommendations that the algorithm would produce. They decided to try two different options. One was care that followed standards from clinical practice guidelines based on literature available through such venues as the National Guideline Clearinghouse. The second was care reflecting patterns that typically led to better-than-expected patient outcomes — calculated by the computer from data in the electronic patient records.
With this established, they tested the algorithm to see how its recommendations for care for a condition such as pneumonia compared to both standard clinical practice guidelines and to computer-generated above-average patient care. First, they fed the algorithm information from patients seen by expert doctors. Then, separately, the algorithm looked at patients seen by all doctors.
The result? Not much difference between recommendations derived from care in the expert group versus all of the doctors.
Essentially, Chen said, outlier behavior from specific doctors was neutralized. And without significant differences in the results, the findings argue for building machine-learning models around data from more physicians rather than a curated bunch. The logic is similar to calculating an average: with more numbers in the equation, each number carries less weight and the average is less influenced by individual numbers — including anomalies.
Chen summed it up:
As a human, it’s not feasible to interview thousands of people to help learn medical practice, so I’m going to seek out a few key mentors and experts. What large datasets and computational algorithms enable us to do is to essentially learn from ‘everybody,’ rather than being constrained to learn from a small number of “experts.”
With this question resolved, Chen and his colleagues have now built a prototype user interface incorporating the suggestion system, and they plan to bring in doctors to test it with simulated patient cases.
“The next big challenge to closing the loop on a learning health care system,” he said, “is… where we don’t just learn interesting things from clinical data, but we design, study, and evaluate how to deliver that information back to clinicians and patients.”
Photo by Markus Spiske