Posted on February 12, 2023

Health Care Bias Is Dangerous. But So Are ‘Fairness’ Algorithms

Wired, February 8, 2023

MENTAL AND PHYSICAL health are crucial contributors to living happy and fulfilled lives. {snip} Artificial intelligence is one of the big hopes, and many companies are investing heavily in tech to serve growing health needs across the world. And many promising examples exist: AI can be used to detect cancer, triage patients, and make treatment recommendations. One goal is to use AI to increase access to high-quality health care, especially in places and for people that have historically been shut out.

Yet racially biased medical devices, for example, caused delayed treatment for darker-skinned patients during the Covid-19 pandemic because pulse oximeters overestimated blood oxygen levels in minorities. {snip} Patient triage systems regularly underestimate the need for care in minority ethnic patients. {snip}

Fortunately, many in the AI community are now actively working to redress these kinds of biases. Unfortunately, as our latest research shows, the algorithms they have developed could actually make things worse in practice and put people’s lives at risk.

The majority of algorithms developed to enforce “algorithmic fairness” were built without policy and societal contexts in mind. Most define fairness in simple terms, where fairness means reducing gaps in performance or outcomes between demographic groups. Successfully enforcing fairness in AI has come to mean satisfying one of these abstract mathematical definitions while preserving as much of the accuracy of the original system as possible.

{snip}

Imagine that, in the interest of fairness, we want to reduce bias in an AI system used for predicting future risk of lung cancer. Our imaginary system, similar to real world examples, suffers from a performance gap between Black and white patients. Specifically, the system has lower recall for Black patients, meaning it routinely underestimates their risk of cancer and incorrectly classifies patients as “low risk” who are actually at “high risk” of developing lung cancer in the future.

{snip}

One way to improve the situation of Black patients is therefore to improve the system’s recall. As a first step, we may decide to err on the side of caution and tell the system to change its predictions for the cases it is least confident about involving Black patients. Specifically, we would flip some low-confidence “low risk” cases to “high risk” in order to catch more cases of cancer. This is called “levelling up,” or designing systems to purposefully change some of its predictions for the groups currently disadvantaged by systems {snip}

This change comes at the cost of accuracy; the number of people falsely identified as being at risk of cancer increases, and the system’s overall accuracy declines. However, this trade-off between accuracy and recall is acceptable because failing to diagnose someone with cancer is so harmful.

By flipping cases to increase recall at the cost of accuracy, we can eventually reach a state where any further changes would come at an unacceptably high loss of accuracy. This is ultimately a subjective decision; there is no true “tipping point” between recall and accuracy. We have not necessarily brought performance (or recall) for Black patients up to the same level as white patients, but we have done as much as possible with the current system, data available, and other constraints to improve the situation of Black patients and reduce the performance gap.

This is where we face a dilemma, and where the narrow focus of modern fairness algorithms on achieving equal performance at all costs creates unintended but unavoidable problems. Though we cannot improve performance for Black patients any further without an unacceptable loss of accuracy, we could also reduce performance for white patients, lowering both their recall and accuracy in the process, so that our system has equal recall rates for both groups. In our example, we would alter the labels of white patients, switching some of the predictions from “high risk” to “low risk.”

{snip}

Clearly, marking a formerly “high risk” patient as “low risk” is extremely harmful for patients who would not be offered follow-up care and monitoring. {snip}

{snip}

{snip} In practice, fairness algorithms may behave much more radically and unpredictably. This survey found that, on average, most algorithms in computer vision improved fairness by harming all groups—for example, by decreasing recall and accuracy. Unlike in our hypothetical, where we have decreased the harm suffered by one group, it is possible that leveling down can make everyone directly worse off.

{snip}