Incident 81: Researchers find evidence of racial, gender, and socioeconomic bias in chest X-ray classifiers

Description: A study by the University of Toronto, the Vector Institute, and MIT showed the input databases that trained AI systems used to classify chest X-rays led the systems to show gender, socioeconomic, and racial biases.

Suggested citation format

Hall, Patrick. (2020-10-21) Incident Number 81. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
81
Report Count
1
Incident Date
2020-10-21
Editors
Sean McGregor, Khoa Lam

Tools

New ReportNew ReportDiscoverDiscover

CSET Taxonomy Classifications

Taxonomy Details

Full Description

A study by the University of Toronto, the Vector Institute, and MIT showed the input databases that trained AI systems used to classify chest X-rays led the systems to show gender, socioeconomic, and racial biases. Google startups like Qure.ai, Aidoc, and DarwinAI can scan chest X-rays to determine likelihood of conditions like fractures and collapsed lungs. The databases used to train the AI were found to consist of examples of primarily white patients (67.64%), leading the diagnostic system to be more accurate with diagnosing white patients than other patients. Black patients were half as likely to be recommended for further care when it was needed.

Short Description

A study by the University of Toronto, the Vector Institute, and MIT showed the input databases that trained AI systems used to classify chest X-rays led the systems to show gender, socioeconomic, and racial biases.

Severity

Unclear/unknown

Harm Distribution Basis

Race, Sex, Financial means

Harm Type

Harm to physical health/safety

AI System Description

Google start up companies Qure.ai, Aidoc, and DarwinAI that use AI systems to analyze medical imagery

System Developer

Google

Sector of Deployment

Human health and social work activities

Relevant AI functions

Perception, Cognition

AI Techniques

medical image processor

AI Applications

image classification

Named Entities

MIT, Mount Sinai Hospital, University of Toronto, Vector Institute, Google, Qure.ai, Aidoc, DarwinAI

Technology Purveyor

Google

Beginning Date

2020-10-21T07:00:00.000Z

Ending Date

2020-10-21T07:00:00.000Z

Near Miss

Unclear/unknown

Intent

Unclear

Lives Lost

No

Infrastructure Sectors

Healthcare and public health

Data Inputs

medical imagery databases

Incidents Reports

Google and startups like Qure.ai, Aidoc, and DarwinAI are developing AI and machine learning systems that classify chest X-rays to help identify conditions like fractures and collapsed lungs. Several hospitals, including Mount Sinai, have piloted computer vision algorithms that analyze scans from patients with the novel coronavirus. But research from the University of Toronto, the Vector Institute, and MIT reveals that chest X-ray datasets used to train diagnostic models exhibit imbalance, biasing them against certain gender, socioeconomic, and racial groups.

Partly due to a reticence to release code, datasets, and techniques, much of the data used today to train AI algorithms for diagnosing diseases may perpetuate inequalities. A team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, Stanford University researchers claimed that most of the U.S. data for studies involving medical uses of AI come from California, New York, and Massachusetts. A study of a UnitedHealth Group algorithm determined that it could underestimate by half the number of Black patients in need of greater care. And a growing body of work suggests that skin cancer-detecting algorithms tend to be less precise when used on Black patients, in part because AI models are trained mostly on images of light-skinned patients.The coauthors of this newest paper sought to determine whether state-of-the-art AI classifiers trained on public medical imaging datasets were fair across different patient subgroups. They specifically looked at MIMIC-CXR (which contains over 370,000 images), Stanford’s CheXpert (over 223,000 images), the U.S. National Institutes of Health’s Chest-Xray (over 112,000 images), and an aggregate of all three, whose scans from over 129,000 patients combined are labeled with the sex and age range of each patient. MIMIC-CXR also has race and insurance type data; excluding 100,000 images, the dataset specifies whether patients are Asian, Black, Hispanic, white, Native American, or other and if they’re on Medicare, Medicaid, or private insurance.After feeding the classifiers the datasets to demonstrate they reached near-state-of-the-art classification performance, which ruled out the possibility that any disparities simply reflected poor overall performance, the researchers calculated and identified disparities across the labels, datasets, and attributes. They found that all four datasets contained “meaningful” patterns of bias and imbalance, with female patients suffering from the highest disparity despite the fact the proportion of women was only slightly less than men. White patients — the majority, with 67.6% of all the X-ray images — were the most-favored subgroup, where Hispanic patients were the least-favored. And bias existed against patients with Medicaid insurance, the minority population with only 8.98% of X-ray images. The classifiers often provided Medicaid patients with incorrect diagnoses.The researchers note that their study has limitations arising from the nature of the labels in the datasets. Each label was extracted from radiology reports using natural language processing techniques, meaning a portion of them could have been erroneous. The coauthors also concede that the quality of the imaging devices themselves, the region of the data collection, and the patient demographics at each collection site might have confounded the results.

However, they assert that even the implication of bias is enough to warrant a closer look at the datasets and any models trained on them. “Subgroups with chronic underdiagnosis are those who experience more negative social determinants of health, specifically, women, minorities, and those of low socioeconomic status. Such patients may use healthcare services less than others,” the researchers wrote. “There are a number of reasons why datasets may induce disparities in algorithms, from imbalanced datasets to differences in statistical noise in each group to differences in access to healthcare for patients of different groups … Although ‘de-biasing’ techniques may reduce disparities, we should not ignore the important biases inherent in existent large public datasets.”Beyond basic dataset challenges, classifiers lacking sufficient peer review can encounter unforeseen roadblocks when deployed in the real world. Scientists at Harvard found that algorithms trained to recognize and classify CT scans could become biased to scan formats from certain CT machine manufacturers. Meanwhile, a Google-published whitepaper revealed challenges in implementing an eye disease-predicting system in Thailand hospitals, including issues with scan accuracy. And studies conducted by companies like Babylon Health, a well-funded telemedicine startup that claims to be able to triage a range of diseases from text messages, have been repeatedly called into question.The researchers of this study recommend that practitioners apply “rigorous” fairness analyses before deployment as one solution to bias. They also suggest that clear disclaimers about the dataset collection process and the potential resulting algorithmic bias could improve assessments for clinical use.

Researchers find evidence of racial, gender, and socioeconomic bias in chest X-ray classifiers

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents