Incident 225: IBM Watson for Oncology Criticized by Customers for Allegedly Unsafe and Inaccurate Cancer Treatment Recommendations
Suggested citation format
Internal documents from IBM Watson Health (NYSE:IBM) indicate that the company’s Watson for Oncology product often returns “multiple examples of unsafe and incorrect treatment recommendations,” according to a new report from STAT News.
The documents come from slides presented last year by IBM Watson Health’s deputy chief health officer, according to the report, and include feedback from customers that indicated the product is “often inaccurate” and that its recommendations bring to light “serious questions about the process for building content and the underlying technology.”
The issues were blamed on training the Watson product received by IBM engineers and physicians at the Memorial Sloan Kettering Cancer Center, which included “synthetic,” or hypothetical patients and cases, instead of real patient data, STAT reports.
IBM has not publicly acknowledged the issues, according to the report, and has communicated to its customers that all data included in Watson for Oncology is based on real patients and that the product has won praise around the world for its recommendations.
Earlier this year, IBM cognitive solutions division senior VP John Kelly touted that Watson “has ingested all of the Memorial Sloan data, historic patients and results,” at an IBM event, and also commented that the company’s Watson product was “going fabulously,” STAT reports.
“It’s critical that everyone involved know what the data was that it was trained on,” Kelly said, according to the report.
But the internal documents indicate that problems, which were known by execs at the company, were serious and systemic, STAT reports. Problems included the product making recommendations in conflict with national treatment guidelines, though no adverse events related to the recommendations were reported.
The documents also indicated that internal studies of the Watson for Oncology product were designed to generate favorable results, according to the report.
Customer feedback from the internal documents include comments reflecting serious dissatisfaction from customers, according to the report.
“This product is a piece of s***. We bought it for marketing and with hopes that you would achieve the vision. We can’t use it for most cases,” a doctor at Florida’s Jupiter Hospital was quoted as saying in the documents, according to STAT.
IBM has defended its Watson for Oncology software, releasing a statement to STAT indicating that it has “learned and improved Watson Health based on continuous feedback from clients, new scientific evidence and new cancers and treatment alternatives,” and that it has released 11 software updates to improve functionality over the past year.
But internal documents indicate that training and effectiveness of the Watson for Oncology system was flawed due to the small number of cases and the inclusion of artificial cases with only one or two doctors supplying recommendations for each type of cancer it was designed to work with, according to the report.
The internal presentation included an example case in which a 65-year old man with lung cancer and evidence of severe bleeding was recommended chemotherapy and a drug called bevacizumab, which includes a “black box” warning advising that it shouldn’t be administered to patients experiencing severe bleeding, according to the report.
“Oy vey. Any time you have an algorithm that makes a recommendation that’s dangerous, that’s extremely worrisome. I mean, the whole idea is that algorithms are supposed to improves safety and quality,” Scripps Translational Science Institute director Dr. Eric Topol told STAT after being informed of the example error.
The report also raises questions about whether IBM Watson is being transparent about the source of its training, and whether its internal studies correctly reflect its abilities.
“The thing which is a bit misleading is that everybody’s led to believe that this is the consensus of the entire brain trust of Sloan Kettering. But in fact it’s the consensus of … a small subset of the entire brain trust. They should be called out on this. I would bet this is a calculated risk they took. … They’re kind of messing with people, but it’s within the marketing spin that is increasingly allowed these days, let me put it that way. But not everybody can spot it, so it’s not honest,” Stanford associate professor of medicine and biomedical data Nigam Shah told STAT , according to the report.
Last month, reports emerged that IBM Watson Health was reportedly cutting back on the portion of its business that sells to hospitals due to a softening market for value-based healthcare offerings.
Internal IBM documents show that its Watson supercomputer often spit out erroneous cancer treatment advice and that company medical specialists and customers identified “multiple examples of unsafe and incorrect treatment recommendations” as IBM was promoting the product to hospitals and physicians around the world.
The documents — slide decks presented last summer by IBM Watson Health’s deputy chief health officer — largely blame the problems on the training of Watson by IBM engineers and doctors at the renowned Memorial Sloan Kettering Cancer Center. The software was drilled with a small number of “synthetic” cancer cases, or hypothetical patients, rather than real patient data. Recommendations were based on the expertise of a few specialists for each cancer type, the documents say, instead of “guidelines or evidence.”
STAT has seen portions of the two presentations, from June and July of 2017. At the time, they were shared widely with the management of IBM’s Watson Health division. The documents contain scathing assessments of the Watson for Oncology product by customers and conclude that the “often inaccurate” recommendations raise “serious questions about the process for building content and the underlying technology.”
IBM has not publicly acknowledged the shortcomings of the software, which uses artificial intelligence algorithms to recommend treatments for individual patients. To the contrary, top company executives told customers and others that Watson for Oncology’s advice for physicians is based on data from real patients, and that it had won nearly universal praise from doctors around the world.
“Physicians like it. Physicians have said to me, if I took it away now, I’d have a revolt,” Deborah DiSanzo, general manager of IBM Watson Health, told STAT in a June 2017 interview.
As recently as two months ago, John Kelly, the senior vice president for IBM’s cognitive solutions division, which includes Watson, said at an IBM event that Watson “has ingested all of the Memorial Sloan data, historic patients and results.” And at another event in April, he said Watson for Oncology is “going fabulously.”
“It’s critical that everyone involved know what the data was that it was trained on,” Kelly added, echoing September 2017 comments from IBM CEO Ginni Rometty, who said, “You must be transparent about it because that matters in these decisions.”
STAT first published an investigation about problems with Watson for Oncology last September, reporting that it was not living up to the company’s expectations, and that it was generating complaints from doctors around the world that its recommendations often weren’t appropriate for patients in their countries. As a result, IBM sullied its reputation in the burgeoning global market for tools using artificial intelligence to improve cancer care, a sector that is a top priority for IBM and potentially worth billions of dollars.
But these new documents reveal that the problems were more serious and systemic and that IBM executives knew that the product was generating inaccurate recommendations that were at odds with national treatment guidelines — although there’s no mention that patients were actually harmed. The documents also state that studies IBM conducted on the software, whose findings were touted as evidence of the system’s usefulness, were designed to generate favorable results.
The documents were presented by Dr. Andrew Norden, an oncologist and deputy health chief, before he left IBM Watson Health last August. They show that, while IBM executives were talking up the product publicly, they were hearing harsh comments from customers. Even doctors at hospitals helping to promote the product were telling IBM executives privately that it was not useful in treating patients.
“This product is a piece of s—,” one doctor at Jupiter Hospital in Florida told IBM executives, according to the documents. “We bought it for marketing and with hopes that you would achieve the vision. We can’t use it for most cases.”
IBM defended its Watson for Oncology software, saying in a statement to STAT: “We have learned and improved Watson Health based on continuous feedback from clients, new scientific evidence and new cancers and treatment alternatives. This includes 11 software releases for even better functionality during the past year, including national guidelines for cancers ranging from colon to liver cancer.”
It said Watson for Oncology is trained to help treat 13 cancers, and is used by 230 hospitals worldwide.
Caitlin Hool, a spokeswoman for Memorial Sloan Kettering, said in a statement that the internal documents critical of the system’s training and performance reflect “the robust nature of the process” of building and deploying the product in clinical care. Hool said the cancer center is continuously working with IBM to improve the accuracy and breadth of the system’s recommendations.
“Patient safety is paramount,” Hool said. “While Watson for Oncology provides safe treatment options, treatment decisions ultimately require the involvement and clinical judgement of the treating physician. ... No technology can replace a doctor and his or her knowledge about their individual patient. To that point, the tool is also not equivalent to the cancer care delivered at MSK.”
Norden declined to comment, emailing STAT, “As you are aware, I am no longer employed by IBM Watson Health, and as such I am unable to discuss IBM or its business.”
IBM struck a deal with Memorial Sloan Kettering to train Watson to help treat cancer patients in 2012, and began selling the product in Asia a few years later even though it had only been trained on a handful of cancers.
The presentations by Norden call out an array of purported flaws in the training methods, including the small number of cases used, which it says was “determined without statistical input,” and differences between MSK’s treatments and standard guidelines. They also note that Watson for Oncology was slow to adapt to new research findings and changing treatment guidelines.
The July 27 document stated the training and effectiveness of the product was undermined by the “inadequacy of the training cases.” It said one or two doctors trained the system to give treatment recommendations for each type of cancer, and that the cases were “synthetic,” meaning that they were devised by MSK doctors and were not real patients.
The synthetic cases were compiled by the doctors and IBM engineers to expose Watson for Oncology to clinical scenarios, as opposed to the actual records of patients who were treated at the hospital. That meant that Watson’s recommendations were driven by the doctors’ own treatment preferences — not a machine learning analysis of real patient cases.
The training methods, and Watson’s advice, led to loud complaints from doctors, according to the presentation, contributing to client dissatisfaction and physician concern. It also stated that Memorial Sloan Kettering’s recommendations deviated from guidelines published by the National Comprehensive Cancer Network, a frequently referenced source of treatment recommendations, and in some instances reflected an “unconventional interpretation of evidence.”
The presentation cited as an example Watson’s recommendation that a 65-year-old man with newly diagnosed lung cancer and evidence of severe bleeding be given combination chemotherapy and a drug called bevacizumab. A “black box” warning for the drug, sold under the brand name Avastin, cautions it can lead to “severe or fatal hemorrhage” and shouldn’t be administered to patients experiencing serious bleeding.
“Oy vey,” Dr. Eric Topol, director of the Scripps Translational Science Institute, told STAT upon hearing about the error. “Any time you have an algorithm that makes a recommendation that’s dangerous, that’s extremely worrisome. I mean, the whole idea is that algorithms are supposed to improve safety and quality.”
In its statement, Memorial Sloan Kettering said it believes the lung cancer recommendation cited in the presentation was part of IBM’s system testing, and was not given to a real patient. “This is an important distinction and underscores the importance of testing and the fact that the tool is intended to supplement – not replace — the clinical judgement of the treating physician,” the statement said.
The cancer center also said that in 2014, when Watson for Oncology was still in development, historical patient cases were initially used to train the system. But IBM determined that the synthetic cases — designed to be representative of cohorts of actual MSK patients — were better suited for the development of Watson for Oncology.
“The speed at which standards of care have changed require a more dynamic approach than historical data can provide because historical cases do not necessarily reflect the newest standards of care,” MSK’s statement said. “Synthetic cases also allow for diverse treatment options to be included in Watson for Oncology’s recommendations, rather than a more narrow focus of how individual patients were treated at MSK.”
But product information posted on the IBM website and dated Feb. 15, 2017, implies that Watson continues to be trained with real patient data. It states that Watson for Oncology “analyzes patient data against thousands of historical cases and insights gleaned from thousands of Memorial Sloan Kettering MD and analyst hours.”
The July 2017 presentation shows the number of cases used to train Watson as of that date for eight different cancers; they range from 635 cases for lung cancer to 106 for ovarian.
Experts in artificial intelligence told STAT that IBM’s portrayal of the training, and the number of doctors and patients involved, raises questions about whether it’s being transparent with users about the source and value of Watson’s recommendations.
“The thing which is a bit misleading is that everybody’s led to believe that this is the consensus of the entire brain trust of Sloan Kettering,” said Nigam Shah, associate professor of medicine and biomedical data science at Stanford. “But in fact it’s the consensus of ... a small subset of the entire brain trust.”
“They should be called out on this,” Shah added. “I would bet this is a calculated risk they took. ... They’re kind of messing with people, but it’s within the marketing spin that is increasingly allowed these days, let me put it that way. But not everybody can spot it, so it’s not honest.”
Some experts said it’s possible that hypothetical patient data would do a good job of training the system — if it was representative of real patient data.
“I would certainly want to see some validation to whether the synthetic data is representative of anything that would make sense,” said Dr. Jonathan Chen, an assistant professor at Stanford’s Center for Biomedical Informatic Research.
Jana Eggers, CEO of Nara Logics, an artificial intelligence company, said Watson’s use of synthetic data made clear that this software was not making use of the “big data” that exists in health care — troves of information about individuals that are too complex or burdensome for humans to navigate.
“They’re making up personas of cancer patients, basically,” Eggers said. “Why do you do that when you have the real people?”
The internal documents also raise questions about the validity of the studies conducted by IBM that the company used to demonstrate the value of Watson for Oncology to doctors around the world.
In recent years, IBM and its clinical partners have published multiple studies demonstrating that Watson would achieve a high level of “concordance” with the treatment recommendations of oncologists. That is theoretically valuable because it shows that Watson could generate recommendations within seconds that doctors would otherwise spend many hours or days developing.
But one of the presentations states that the concordance studies had been “designed in a way that makes negative findings unlikely.” It noted that sophisticated users of the system would demand “robust prospective evidence” of compliance with guidelines, cost savings, and improvement on quality metrics.
“The system as currently designed is unlikely to impact any of the above,” Norden’s June 26, 2017, presentation said.
It cautioned, however, about the risks of conducting such a study at a time when IBM was already selling the system around the world: “In my view, conducting a rigorous study now is a very high-risk endeavor that could be embarrassing at best and have serious adverse business consequences at worst.”
Indeed, a company working with IBM to implement Watson for Oncology at hospitals in the Netherlands told STAT in June that they aren’t interested in “concordance” or whether Watson recommends a treatment that is the same as treatment guidelines.
“What we are interested in is whether the system will challenge us in our thinking,” said Vincent The, head of strategy, research, and development for MRDM, the Dutch company.
He said MRDM is working with IBM to improve Watson for Oncology. “It seems to be maturing quite a bit,” The said. “However, at the current state, it’s not at the level we need it to be.”
Many users of Watson for Oncology, as well as IBM’s own employees, were raising concerns about the system’s performance last year. Norden’s presentation stated that a number of employees, including developers and oncologists, had “confided” in him that the product was “very limited.”
Norden’s presentation offered a couple of solutions for improving the system: the number of patient cases could be increased to reflect “MSK practice patterns,” with minimum numbers to be set based on statistician input. Or, he said, standard treatment guidelines could be used as the basis of Watson’s recommendations, customized for different locations, and Memorial Sloan Kettering could identify its own institutional approaches within the system.
It is not clear whether any of those suggestions was implemented.
Norden left IBM in August 2017 to take a job at Cota Inc., another company seeking to analyze cancer data that had already struck up a partnership with IBM.
Since June 2017, the two companies have been working on a joint product that would combine Cota’s technology with Watson. Cota can compare individual cancer patients to a private database of how other patients were treated to help a doctor determine the best method of treatment.
Norden’s presentation suggested that the work with Cota, a company founded by a doctor at Hackensack Meridian Health in New Jersey, could help expose Watson for Oncology to the real-world evidence needed to improve its recommendations.
Doctors at Hackensack Meridian have meanwhile completed a pilot project testing the joint product.
“It went well,” Norden said of the project in a July 10 interview with STAT, explaining that it was intended to create a tool physicians could use at the point of care to highlight the types of treatments that had worked best for specific types of patients.
“We’re taking the pilot and scaling it up,” Norden said, declining to elaborate.
Memorial Sloan Kettering had also noticed the promise of Cota’s product: In November 2017, it struck a deal to have Cota help analyze the medical center’s historical patient records. As part of the deal, MSK received equity in Cota.
In an emailed response to STAT questions about the reason for the partnership, Hool, the MSK spokeswoman, wrote: “There are important insights to be gained from the information in patients’ medical records, but these insights are often inaccessible due to unstructured, disjointed data within the medical record. Cota has an innovative platform and approach for taking patient records and turning them into data sets that can be analyzed for insights into diagnosis, care pathways, and outcomes.”