Incident 11: Northpointe Risk Models
Suggested citation format
CSET Taxonomy ClassificationsTaxonomy Details
An algorithm developed by Northpointe and used in the penal system is shown to be inaccurate and produces racially-skewed results according to a review by ProPublica. The review shows how the 137-question survey given following an arrest is inaccurate and skewed against people of color. While there is not question regarding race in the survey, the algorithm is two times more likely to incorrectly label a black person as a high-risk re-offender (False Positive) and is also two times more likely to incorrectly label a white person as low-risk for reoffense (False Negative) than actual statistics support. Overall, the algorithm is 61% effective at predicting reoffense. This system is used in Broward County, Florida to help judges make decisions surrounding pre-trial release and sentencing post-trial.
An algorithm developed by Northpointe and used in the penal system is two times more likely to incorrectly label a black person as a high-risk re-offender and is two times more likely to incorrectly label a white person as low-risk for reoffense according to a ProPublica review.
Harm Distribution Basis
Harm to civil liberties, Other:Reputational harm; False incarceration
AI System Description
An algorithm, developed by Northpointe designed to assign a risk score associated with a person's likelihood of reoffending after their original arrest.
Sector of Deployment
Public administration and defence
Relevant AI functions
law enforcement algorithm, crime prediction algorithm
risk assesment, crime projection
Broward County, Florida
ProPublica, Northpointe, COMPAS, Broward County, FL
Computer programs that perform risk assessments of crime suspects are increasingly common in American courtrooms, and are used at every stage of the criminal justice systems to determine who may be set free or granted parole, and the size of the bond they must pay. By 2016, the results of these assessments were given to judges during criminal sentencing and a sentencing reform bill was proposed in Congress to mandate the use of such assessments in federal prisons. In a study of the risk scores assigned to more than 7,000 people in Florida's Broward County in 2013 and 2014, ProPublica found that only 20% of the people the system predicted would commit violent crimes had actually done so. For the full range of crimes including misdemeanours, 61% of those predicted to re-offend were arrested for later crimes over the following two years.
ProPublic also found significant racial disparities. Although the algorithm made errors at roughly the same rate for black and white defendants, it incorrectly labelled black defendants as likely to commit further crimes at twice the reats as white defendants. Conversely, white defendants were mislabelled as low risk more often than black defendants. Northpointe, the company that produced the system, known as COMPAS, disputed ProPublic's analysis but declined to share its calculations, which the company said were proprietary. However, it did disclose that the basics of its formula included factors such as education levels and employment status among the 137 questions that are either answered by defendants or extracted from criminal records. These tools have been rolled out in many areas before they have been rigorously evaluated, and defendants are rarely able to find out the basis for the scores they're assigned.
Writer: Julia Angwin, Jeff Larson, Surya Mattu, Lauren Kirchner
Across the nation, judges, probation and parole officers are increasingly using algorithms to assess a criminal defendant’s likelihood of becoming a recidivist – a term used to describe criminals who re-offend. There are dozens of these risk assessment algorithms in use. Many states have built their own assessments, and several academics have written tools. There are also two leading nationwide tools offered by commercial vendors.
We set out to assess one of the commercial tools made by Northpointe, Inc. to discover the underlying accuracy of their recidivism algorithm and to test whether the algorithm was biased against certain groups.
Our analysis of Northpointe’s tool, called COMPAS (which stands for Correctional Offender Management Profiling for Alternative Sanctions), found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk.
We looked at more than 10,000 criminal defendants in Broward County, Florida, and compared their predicted recidivism rates with the rate that actually occurred over a two-year period. When most defendants are booked in jail, they respond to a COMPAS questionnaire. Their answers are fed into the COMPAS software to generate several scores including predictions of “Risk of Recidivism” and “Risk of Violent Recidivism.”
We compared the recidivism risk categories predicted by the COMPAS tool to the actual recidivism rates of defendants in the two years after they were scored, and found that the score correctly predicted an offender’s recidivism 61 percent of the time, but was only correct in its predictions of violent recidivism 20 percent of the time.
In forecasting who would re-offend, the algorithm correctly predicted recidivism for black and white defendants at roughly the same rate (59 percent for white defendants, and 63 percent for black defendants) but made mistakes in very different ways. It misclassifies the white and black defendants differently when examined over a two-year follow-up period.
Our analysis found that:
Black defendants were often predicted to be at a higher risk of recidivism than they actually were. Our analysis found that black defendants who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts (45 percent vs. 23 percent).
White defendants were often predicted to be less risky than they were. Our analysis found that white defendants who re-offended within the next two years were mistakenly labeled low risk almost twice as often as black re-offenders (48 percent vs. 28 percent).
The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.
Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.
The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants.
In 2013, researchers Sarah Desmarais and Jay Singh examined 19 different recidivism risk methodologies being used in the United States and found that “in most cases, validity had only been examined in one or two studies conducted in the United States, and frequently, those investigations were completed by the same people who developed the instrument.”
Their analysis of the research published before March2013 found that the tools “were moderate at best in terms of predictive validity,” Desmarais said in an interview. And she could not find any substantial set of studies conducted in the United States that examined whether risk scores were racially biased. “The data do not exist,” she said.
The largest examination of racial bias in U.S. risk assessment algorithms since then is a 2016 paper by Jennifer Skeem at University of California, Berkeley and Christopher T. Lowenkamp from the Administrative Office of the U.S. Courts. They examined data about 34,000 federal offenders to test the predictive validity of the Post Conviction Risk Assessment tool that was developed by the federal courts to help probation and parole officers determine the level of supervision required for an inmate upon release.
The authors found that the average risk score for black offenders was higher than for white offenders, but that concluded the differences were not attributable to bias.
A 2013 study analyzed the predictive validity among various races for another score called the Level of Service Inventory, one of the most popular commercial risk scores from Multi-Health Systems. That study found that “ethnic minorities have higher LS scores than nonminorities.” The study authors, who are Canadian, noted that racial disparities were more consistently found in the U.S. than in Canada. “One possibility may be that systematic bias within the justice system may distort the measurement of ‘true’ recidivism,” they wrote.
A smaller 2006 study of 532 male residents of a work-release program also found “a tendency toward classification errors for African Americans” in the Level of Service Inventory-Revised. The study, by Kevin Whiteacre of the Salvation Army Correctional Services Program, found that 42.7 percent of African Americans were incorrectly classified as high risk, compared with 27.7 percent of Caucasians and 25 percent of Hispanics. That study urged correctional facilities to investigate the their use of the scores independently using a simple contingency table approach that we follow later in this study.
As risk scores move further into the mainstream of the criminal justice system, policy makers have called for further studies of whether the scores are biased.
When he was U.S. Attorney General, Eric Holder asked the U.S. Sentencing Commission to study potential bias in the tests used at sentencing. “Although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice,” he said, adding, “they may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.” The sentencing commission says it is not currently conducting an analysis of bias in risk assessments.
So ProPublica did its own analysis.
How We Acquired the Data
We chose to examine the COMPAS algorithm because it is one of the most popular scores used nationwide and is increasingly being used in pretrial and sentencing, the so-called “front-end” of the criminal justice system. We chose Broward County because it is a large jurisdiction using the COMPAS tool in pretrial release decisions and Florida has strong open-records laws.
Through a public records request, ProPublica obtained two years worth of COMPAS scores from the Broward County Sheriff’s Office in Florida. We received data for all 18,610 people who were scored in 2013 and 2014.
Because Broward County primarily uses the score to determine whether to release or detain a defendant before his or her trial, we discarded scores that were assessed at parole, probation or other stages in the criminal justice system. That left us with 11,757 people who were assessed at the pretrial stage.
Each pretrial defendant received at least three COMPAS scores: “Risk of Recidivism,” “Risk of Violence” and “Risk of Failure to Appear.”
COMPAS scores for each defendant ranged from 1 to 10, with ten being the highest risk. Scores 1 to 4 were labeled by COMPAS as “Low”; 5 to 7 were labeled “Medium”; and 8 to 10 were labeled “High.”
Starting with the database of COMPAS scores, we built a profile of each person’s criminal history, both before and after they were scored. We collected public criminal records from the Broward County Clerk’s Office website through April 1, 2016. On average, defendants in our dataset were not incarcerated for 622.87 days (sd: 329.19).
We matched the criminal records to the COMPAS records using a person’s first and last names and date of birth. This is the same technique used in the Broward County COMPAS validation study conducted by researchers at Florida State University in 2010. We downloaded around 80,000 criminal records from the Broward County Clerk’s Office website.
To determine race, we used the race classifications used by the Broward County Sheriff’s Office, which identifies defendants as black, white, Hispanic, Asian and Native American. In 343 cases, the race was marked as Other.
We also compiled each person’s record of incarceration. We received jail records from the Broward County Sheriff’s Office from January 2013 to April 2016, and we downloaded public incarceration records from the Florida Department of Corrections website.
We found that sometimes people’s names or dates of birth were incorrectly entered in some records – which led to incorrect matches between an individual’s COMPAS score and his or her criminal records. We attempted to determine how many records were affected. In a random sample of 400 cases, we found an error rate of 3.75 percent (CI: +/- 1.8 percent).
How We Defined Recidivism
Defining recidivism was key to our analysis.
In a 2009 study examining the predictive power of its COMPAS score, Northpointe defined recidivism as “a finger-printable arrest involving a charge and a filing for any uniform crime reporting (UCR) code.” We interpreted that to mean a criminal offense that resulted in a jail booking and took place after the crime for which the person was COMPAS scored.
It was not always clear, however, which criminal case was associated with an individual’s COMPAS score. To match COMPAS scores with accompanying cases, we considered cases with arrest dates or charge dates within 30 days of a COMPAS assessment being conducted. In some instances, we could not find any corresponding charges to COMPAS scores. We removed those cases from our analysis.
Next, we sought to determine if a person had been charged with a new crime subsequent to crime for which they were COMPAS screened. We did not count traffic tickets and some municipal ordinance violations as recidivism. We did not count as recidivists people who were arrested for failing to appear at their court hearings, or people who were later charged with a crime that occurred prior to their COMPAS screening.
For violent recidivism, we used the FBI’s definition of violent crime, a category that includes murder, manslaughter, forcible rape, robbery and aggravated assault.
For most of our analysis, we defined recidivism as a new arrest within two years. We based this decision on Northpointe’s practitioners guide, which says that its recidivism score is meant to predict “a new misdemeanor or felony offense within two years of the COMPAS administration date.”
In addition, a recent study of 25,000 federal prisoners’ recidivism rates by the U.S. Sentencing Commission, which shows that most recidivists commit a new crime within the first two years after release (if they are going to commit a crime at all).
We analyzed the COMPAS scores for “Risk of Recidivism” and “Risk of Violent Recidivism.” We did not analyze the COMPAS score for “Risk of Failure to Appear.”
We began by looking at the risk of recidivism score. Our initial analysis looked at the simple distribution of the COMPAS decile scores among whites and blacks. We plotted the distribution of these scores for 6,172 defendants who had not been arrested for a new offense or who had recidivated within two years.
These histograms show that scores for white defendants were skewed toward lower-risk categories, while black defendants were evenly distributed across scores. In our two-year sample, there were 3,175 black defendants and 2,103 white defendants, with 1,175 female defendants and 4,997 male defendants. There were 2,809 defendants who recidivated within two years in this sample.
The histograms for COMPAS’s violent risk score also show a disparity in score distribution between white and black defendants. The sample we used to test COMPAS’s violent recidivism score was slightly smaller than for the general recidivism score: 4,020 defendants, 1,918 black defendants and 1,459 white defendants. There were 652 violent recidivists.
While there is a clear difference between the distributions of COMPAS scores for white and black defendants, merely looking at the distributions does not account for other demographic and behavioral factors.
To test racial disparities in the score controlling for other factors, we created a logistic regression model that considered race, age, criminal history, future recidivism, charge degree, gender and age.
Risk of General Recidivism Logistic Model
Score (Low vs Medium and High)
Female 0.221*** (0.080)
Age: Greater than 45 -1.356*** (0.099)
Age: Less than 25 1.308*** (0.076)
Black 0.477*** (0.069)
Asian -0.254 (0.478)
Hispanic -0.428*** (0.128)
Native American 1.394* (0.766)
Other -0.826*** (0.162)
Number of Priors 0.269*** (0.011)
Misdemeanor -0.311*** (0.067)
Two year Recidivism 0.686*** (0.064)
Constant -1.526*** (0.079)
Akaike Inf. Crit. 6,192.402
Note: *p<0.1; **p<0.05; ***p<0.01
We used those factors to model the odds of getting a higher COMPAS score. According to Northpointe’s practitioners guide, COMPAS “scores in the medium and high range garner more interest from supervision agencies than low scores, as a low score would suggest there is little risk of general recidivism,” so we considered scores any higher than “low” to indicate a risk of recidivism.
Our logistic model found that the most predictive factor of a higher risk score was age. Defendants younger than 25 years old were 2.5 times as likely to get a higher score than middle aged offenders, even when controlling for prior crimes, future criminality, race and gender.
Race was also quite predictive of a higher score. While Black defendants had higher recidivism rates overall, when adjusted for this difference and other factors, they were 45 percent more likely to get a higher score than whites.
Surprisingly, given their lower levels of criminality overall, female defendants were 19.4 percent more likely to get a higher score than men, controlling for the same factors.
Risk of Violent Recidivism Logistic Model
Score (Low vs Medium and High)
Female -0.729*** (0.127)
Age: Greater than 45 -1.742*** (0.184)
Age: Less than 25 3.146*** (0.115)
Black 0.659*** (0.108)
Asian -0.985 (0.705)
Hispanic -0.064 (0.191)
Native American 0.448 (1.035)
Other -0.205 (0.225)
Number of Priors 0.138*** (0.012)
Misdemeanor -0.164* (0.098)
Two Year Recidivism 0.934*** (0.115)
Constant -2.243*** (0.113)
Akaike Inf. Crit. 3,022.779
Note: *p<0.1; **p<0.05; ***p<0.01
The COMPAS software also has a score for risk of violent recidivism. We analyzed 4,020 people who were scored for violent recidivism over a period of two years (not including time spent incarcerated). We ran a similar regression model for these scores.
Age was an even stronger predictor of a higher score for violent recidivism. Our regression showed that young defendants were 6.4 times more likely to get a higher score than middle age defendants, when correcting for criminal history, gender, race and future violent recidivism.
Race was also predictive of a higher score for violent recidivism. Black defendants were 77.3 percent more likely than white defendants to receive a higher score, correcting for criminal history and future violent recidivism.
To test COMPAS’s overall predictive accuracy, we fit a Cox proportional hazards model to the data – the same technique that Northpointe used in its own validation study. A Cox model allows us to compare rates of recidivism while controlling for time. Because we aren’t controlling for other factors such as a defendant’s criminality we can include more people in this Cox model. For this analysis our sample size was 10,314 defendants (3,569 white defendants and 5,147 black defendants).
Risk of General Recidivism Cox Model
High Risk 1.250*** (0.041)
Medium Risk 0.796*** (0.041)
Max. Possible R2 0.990
Wald Test 954.820*** (df = 2)
LR Test 942.824*** (df = 2)
Score (Logrank) Test 1,054.767*** (df = 2)
Note: *p<0.1; **p<0.05; ***p<0.01
We considered people in our data set to be “at risk” from the day they were given the COMPAS score until the day they committed a new offense or April 1, 2016, whichever came first. We removed people from the risk set while they were incarcerated. The independent variable in the Cox model was the COMPAS categorical risk score.
The Cox model showed that people with high scores were 3.5 times as likely to recidivate as people in the low (scores 1 to 4) category. Northpointe’s study, found that people with high scores (scores 8 to 10) were 5.6 times as likely to recidivate. Both results indicate that the score has predictive value.
A Kaplan Meier survival plot also shows a clear difference in recidivism rates between each COMPAS score level.
Overall, the Cox regression had a concordance score of 63.6 percent. That means for any randomly selected pair of defendants in the sample, the COMPAS system can accurately rank their recidivism risk 63.6 percent of the time (e.g. if one person of the pair recidivates, that pair will count as a successful match if that person also had a higher score). In its study, Northpointe reported a slightly higher concordance: 68 percent.
Running the Cox model on the underlying risk scores - ranked 1 to 10 - rather than the low, medium and high intervals yielded a slightly higher concordance of 66.4 percent.
Both results are lower than what Northpointe describes as a threshold for reliability. “A rule of thumb according to several recent articles is that AUCs of .70 or above typically indicate satisfactory predictive accuracy, and measures between .60 and .70 suggest low to moderate predictive accuracy,” the company says in its study.
The COMPAS violent recidivism score had a concordance of 65.1 percent.
The COMPAS system unevenly predicts recidivism between genders. According to Kaplan-Meier estimates, women rated high risk recidivated at a 47.5 percent rate during two years after they were scored. But men rated high risk recidivated at a much higher rate – 61.2 percent – over the same time period. This means that a high-risk woman has a much lower risk of recidivating than a high-risk man, a fact that may be overlooked by law enforcement officials interpreting the score.
Northpointe does offer a custom test for women, but it is not in use in Broward County.
The predictive accuracy of the COMPAS recidivism score was consistent between races in our study – 62.5 percent for white defendants vs. 62.3 percent for black defendants. The authors of the Northpointe study found a small difference in the concordance scores by race: 69 percent for white defendants and 67 percent for black defendants.
Across every risk category, black defendants recidivated at higher rates.
Risk of General Recidivism Cox Model (with Interaction Term)
Black 0.279*** (0.061)
Asian -0.777 (0.502)
Hispanic -0.064 (0.097)
Native American -1.255 (1.001)
Other 0.014 (0.110)
High Score 1.284*** (0.084)
Medium Score 0.843*** (0.071)
Black:High -0.190* (.100, p: 0.0574)
Asian:High 1.316* (0.768)
Hispanic:High -0.119 (0.198)
Native American:High 1.956* (.083)
Other:High 0.415 (0.259)
Black:Medium -0.173* (.091, p: 0.0578)
Asian:Medium 0.986 (0.711)
Hispanic:Medium 0.065 (0.164)
Native American:Medium 1.390 (1.120)
Other:Medium -0.334 (0.232)
Max. Possible R2 0.990
Log Likelihood -30,280.410
Wald Test 988.830*** (df = 17)
LR Test 993.709*** (df = 17)
Score (Logrank) Test 1,104.894*** (df = 17)
Note: *p<0.1; **p<0.05; ***p<0.01
We also added a race-by-score interaction term to the Cox model. This term allowed us to consider whether the difference in recidivism between a high score and low score was different for black defendants and white defendants.
The coefficient on high scores for black defendants is almost statistically significant (0.0574). High-risk white defendants are 3.61 times as likely to recidivate as low-risk white defendants, while high-risk black defendants are only 2.99 times as likely to recidivate as low-risk black defendants. The hazard ratios for medium-risk defendants vs. low risk defendants also are different across races: 2.32 for white defendants and 1.95 for black defendants. Because of the gap in hazard ratios, we can conclude that the score is performing differently among racial subgroups.
We ran a similar analysis on COMPAS’s violent recidivism score, however we did not find a similar result. Here, we found that the interaction term on race and score was not significant, meaning that there is no significant difference the hazards of high and low risk black defendants and high and low risk white defendants.
Overall, there are far fewer violent recidivists than general recidivists and there isn’t a clear difference in the hazard rates across score levels for black and white recidivists. These Kaplan Meier plots show very low rates of violent recidivism.
Finally, we investigated whether certain types of errors – false positives and false negatives – were unevenly distributed among races. We used contingency tables to determine those relative rates following the analysis outlined in the 2006 paper from the Salvation Army.
We removed people from our data set for whom we had less than two years of recidivism information. The remaining population was 7,214 – slightly larger than the sample in the logistic models above, because we don’t need a defendant’s case information for this analysis. As in the logistic regression analysis, we marked scores other than “low” as higher risk. The following tables show how the COMPAS recidivism score performed:
Survived 2681 1282
Recidivated 1216 2035
FP rate: 32.35
FN rate: 37.40
Survived 990 805
Recidivated 532 1369
FP rate: 44.85
FN rate: 27.99
Survived 1139 349
Recidivated 461 505
FP rate: 23.45
FN rate: 47.72
These contingency tables reveal that the algorithm is more likely to misclassify a black defendant as higher risk than a white defendant. Black defendants who do not recidivate were nearly twice as likely to be classified by COMPAS as higher risk compared to their white counterparts (45 percent vs. 23 percent). However, black defendants who scored higher did recidivate slightly more often than white defendants (63 percent vs. 59 percent).
The test tended to make the opposite mistake with whites, meaning that it was more likely to wrongly predict that white people would not commit additional crimes if released compared to black defendants. COMPAS under-classified white reoffenders as low risk 70.5 percent more often than black reoffenders (48 percent vs. 28 percent). The likelihood ratio for white defendants was slightly higher 2.23 than for black defendants 1.61.
We also tested whether restricting our definition of high risk to include only COMPAS’s high score, rather than including both medium and high scores, changed the results of our analysis. In that scenario, black defendants were three times as likely as white defendants to be falsely rated at high risk (16 percent vs. 5 percent).
We found similar results for the COMPAS violent recidivism score. As before, we calculated contingency tables based on how the score performed:
Survived 4121 1597
Recidivated 347 389
FP rate: 27.93
FN rate: 47.15
Survived 1692 1043
Recidivated 170 273
FP rate: 38.14
FN rate: 38.37
Survived 1679 380
Recidivated 129 77
FP rate: 18.46
FN rate: 62.62
Black defendants were twice as likely as white defendants to be misclassified as a higher risk of violent recidivism, and white recidivists were misclassified as low risk 63.2 percent more often than black defendants. Black defendants who were classified as a higher risk of violent recidivism did recidivate at a slightly higher rate than white defendants (21 percent vs. 17 percent), and the likelihood ratio for white defendants was higher, 2.03, than for black defendants, 1.62.
For years, the criminal justice community has been worried. Courts across the country are assigning bond amounts sentencing the accused based on algorithms, and both lawyers and data scientists warn that these algorithms could be poisoned by the prejudices these systems were designed to escape.
Until now, that concern was pure speculation. Now, we know the truth.
An investigation published Monday morning by ProPublica analyzed the results of thousands of sentences handed out by algorithms, and found that these formulas are easier on white defendants, even when race is isolated as a factor.
"The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants," the investigative team wrote.
The algorithms don't take race directly into account, but instead use data that stands in for correlative information that could stand in as a proxy. The Florida algorithm evaluated in the report is based on 137 questions, such as "Was one of your parents ever sent to jail or prison?" and "How many of your friends/acquaintances are taking drugs illegally?"
Those two questions, for example, can appear to evaluate someone's empirical risk of criminality, but instead, they target those already living under institutionalized poverty and over-policing. Predominantly, those people are people of color.
"[Punishment] profiling sends the toxic message that the state considers certain groups of people dangerous based on their identity," University of Michigan law professor Sonja Starr wrote in the New York Times in 2014. "It also confirms the widespread impression that the criminal justice system is rigged against the poor."
The algorithm itself, of course, was not available for audit. Algorithms that inform decisions in the public sector are often developed and protected by private companies — Northpointe, a for-profit company that created the algorithm examined by ProPublica, told ProPublica that it does not agree with the results of the analysis. It "accurately reflect the outcomes" of its product, Northpointe said.
ProPublica White defendants were routinely given lower threat scores than black defendants.
But the controversy over sentencing is just one early instance of a growing conversation about bias in the algorithms that decide everything from what news we see to how and where we travel.
It's time to talk about algorithms: Algorithms seem impervious from the insidious influence of racism and prejudice, human innovations that can unconsciously creep into our fallible decision-making processes. Evaluations that come from algorithms imply that the results are scientific — spat out by a cold computer working only with evidence. The process of sentencing by algorithms is even formally referred to as "evidence-based sentencing."
"Scores give us simplistic ways of thinking that are very hard to resist," Cathy O'Neil, a data scientist and author of the upcoming book Weapons of Math Destruction, said by phone. "If you assign people scores and someone has a low score, it's human nature to assign blame to that person, even if that score just means they were born in a poor neighborhood."
But just because algorithms are mathematical in nature, doesn't mean they're free from human bias. Algorithms spot and amplify patterns in human behavior, and they do it by looking at the data created by human behavior. Predictive policing algorithms that help police chiefs assign their patrols rely on crime statistics and records generated by police behavior, eventually amplifying the prejudicial behaviors that led to that data in the first place.
As more news emerges of bias in algorithms — whether it's the potential anti-conservative bias of Facebook's news algorithm or pricing schemes that charge Asian communities more for SAT tutoring — the world is further disavowed of the idea that algorithms can't be as skewed as human reasoning.
Often, they are skewed in precisely the same way we are.
On a spring afternoon in 2014, Brisha Borden was running late to pick up her god-sister from school when she spotted an unlocked kid’s blue Huffy bicycle and a silver Razor scooter. Borden and a friend grabbed the bike and scooter and tried to ride them down the street in the Fort Lauderdale suburb of Coral Springs.
Just as the 18-year-old girls were realizing they were too big for the tiny conveyances — which belonged to a 6-year-old boy — a woman came running after them saying, “That’s my kid’s stuff.” Borden and her friend immediately dropped the bike and scooter and walked away.
But it was too late — a neighbor who witnessed the heist had already called the police. Borden and her friend were arrested and charged with burglary and petty theft for the items, which were valued at a total of $80.
Compare their crime with a similar one: The previous summer, 41-year-old Vernon Prater was picked up for shoplifting $86.35 worth of tools from a nearby Home Depot store.
Prater was the more seasoned criminal. He had already been convicted of armed robbery and attempted armed robbery, for which he served five years in prison, in addition to another armed robbery charge. Borden had a record, too, but it was for misdemeanors committed when she was a juvenile.
Yet something odd happened when Borden and Prater were booked into jail: A computer program spat out a score predicting the likelihood of each committing a future crime. Borden — who is black — was rated a high risk. Prater — who is white — was rated a low risk.
Two years later, we know the computer algorithm got it exactly backward. Borden has not been charged with any new crimes. Prater is serving an eight-year prison term for subsequently breaking into a warehouse and stealing thousands of dollars’ worth of electronics.
Scores like this — known as risk assessments — are increasingly common in courtrooms across the nation. They are used to inform decisions about who can be set free at every stage of the criminal justice system, from assigning bond amounts — as is the case in Fort Lauderdale — to even more fundamental decisions about defendants’ freedom. In Arizona, Colorado, Delaware, Kentucky, Louisiana, Oklahoma, Virginia, Washington and Wisconsin, the results of such assessments are given to judges during criminal sentencing.
Rating a defendant’s risk of future crime is often done in conjunction with an evaluation of a defendant’s rehabilitation needs. The Justice Department’s National Institute of Corrections now encourages the use of such combined assessments at every stage of the criminal justice process. And a landmark sentencing reform bill currently pending in Congress would mandate the use of such assessments in federal prisons.
Two Petty Theft Arrests Vernon Prater Prior Offenses 2 armed robberies, 1 attempted armed robbery Low Risk 3 Subsequent Offenses 1 grand theft Brisha Borden Prior Offenses 4 juvenile misdemeanors High Risk 8 Subsequent Offenses None Borden was rated high risk for future crime after she and a friend took a kid’s bike and scooter that were sitting outside. She did not reoffend.
In 2014, then U.S. Attorney General Eric Holder warned that the risk scores might be injecting bias into the courts. He called for the U.S. Sentencing Commission to study their use. “Although these measures were crafted with the best of intentions, I am concerned that they inadvertently undermine our efforts to ensure individualized and equal justice,” he said, adding, “they may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society.”
The sentencing commission did not, however, launch a study of risk scores. So ProPublica did, as part of a larger examination of the powerful, largely hidden effect of algorithms in American life.
We obtained the risk scores assigned to more than 7,000 people arrested in Broward County, Florida, in 2013 and 2014 and checked to see how many were charged with new crimes over the next two years, the same benchmark used by the creators of the algorithm.
The score proved remarkably unreliable in forecasting violent crime: Only 20 percent of the people predicted to commit violent crimes actually went on to do so.
When a full range of crimes were taken into account — including misdemeanors such as driving with an expired license — the algorithm was somewhat more accurate than a coin flip. Of those deemed likely to re-offend, 61 percent were arrested for any subsequent crimes within two years.
We also turned up significant racial disparities, just as Holder feared. In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
White defendants were mislabeled as low risk more often than black defendants.
The Hidden Discrimination In Criminal Risk-Assessment Scores
Courtrooms across the country are increasingly using a defendant's "risk assessment score" to help make decisions about bond, parole and sentencing. The companies behind these scores say they help predict whether a defendant will commit more crimes in the future. NPR's Kelly McEvers talks with Julia Angwin of ProPublica about a new investigation into risk assessment scores.
KELLY MCEVERS, HOST:
Imagine if you could use an algorithm to predict trouble and maybe avoid it. The idea has gained traction throughout the criminal justice system from police departments to courtrooms. It's called criminal risk assessment.
AUDIE CORNISH, HOST:
The nonprofit news outlet ProPublica looked at one of the most widely used risk-assessment programs and how it fared in Broward County, Fla. There people who have been arrested are given a questionnaire and then a score from 1 to 10.
MCEVERS: Four or higher suggests they are likely to reoffend. Reporter Julia Angwin says the analysis was accurate about 61 percent of the time, and it treated blacks and whites differently. I asked her to describe how the algorithm works.
JULIA ANGWIN: The one that we were looking at is put together by a proprietary software company named Northpointe. They don't disclose their exact formula, but they told us sort of generally what went into, which is essentially your criminal history, your education level, whether you have a job. And then there's a whole bunch of questions that have to do with your criminal thinking. So, for instance, do you agree or disagree that it's OK for a hungry person to steal? And not every assessment asks that. But the core product that Northpointe offers is 137 questions long. And then jurisdictions can use any subset.
MCEVERS: Other questions, like, you know, were your parents ever sent to jail or prison? How many of your friends are taking drugs illegally? How often did you - do you get in fights while at school? Stuff like that, right?
ANGWIN: Yeah. there are questions are a lot about your family, your attitudes. Do you feel bored often? Do you have anger issues? Did your family - ever been arrested?
MCEVERS: Does the questionnaire ask specifically about your race?
ANGWIN: No, it does not ask about your race.
MCEVERS: You talked to some hardened criminals who were rated as relatively low risk. Even they were surprised to be rated this way. How did that happen?
ANGWIN: This one guy that we wrote about, Jimmy Rivelli (ph), he is in his 50s, white guy, who, by his own account, has led a life of crime, mostly petty thefts and mostly to fuel his drug addiction, which he's struggling to overcome. But when I told him he was rated low risk, his reaction was that's a very big surprise to me because I had just got out of five years in state prison for drug trafficking when I was arrested for that.
MCEVERS: Wow. And so how is it someone like that can be given such a low rating?
ANGWIN: So we analyzed more than 7,000 scores to see what was causing these types of disparities. And what we found was that although this algorithm is actually OK at predicting general whether you're going to commit another crime within the next two years, it's actually inaccurate in this way where it fails differently for blacks and whites. So black defendants are twice as likely to be rated high risk incorrectly, meaning they did not go on to reoffend. And white defendants are twice as likely to be rated incorrectly as low risk and yet go on to reoffend.
MCEVERS: What does it mean for someone who is incorrectly tagged by this risk-assessment system?
ANGWIN: So basically, if you're given a low-risk score, this is given to the judge. And in Florida, where we were looking at the scores, the judge looks at that while making a decision about pretrial release, meaning can you get out of jail on bond while awaiting trial for your crime? In other jurisdictions where the same exact software is used, this score is used for sentencing.
So when you've been convicted of a crime, the judge gets a secret report, a presentencing investigation, usually sealed to the public, which says this person is high risk. You should take that into consideration when making your sentencing. And those decisions are very important because if you are judging - you're looking at this high-risk thing, you might be inclined - and this has happened - to be more likely to put that person in a longer prison sentence.
MCEVERS: We should say that the company, Northpointe, that administers these tests disputes your findings.
ANGWIN: Yes, that is correct.
MCEVERS: Based on all the evidence that you looked at and all the experts that you talked to, do you think there's a place for these types of algorithms in the criminal justice system?
ANGWIN: The movement towards risk-assessment scores is very well-intentioned, and I'm sympathetic with the idea that we need to make the criminal justice system more objective and more fair. And it hasn
One of my most treasured possessions is The Art of Computer Programming by Donald Knuth, a computer scientist for whom the word “legendary” might have been coined. In a way, one could think of his magnum opus as an attempt to do for computer science what Russell’s and Whitehead’s Principia Mathematica did for mathematics – ie to get back to the basics of the field and check out its foundational elements.
In computer science, one of those foundational elements is the algorithm – a self-contained, step-by-step set of operations to be performed, usually by a computer. Algorithms are a bit like the recipes we use in cooking, but they need to be much more precise because they have to be implemented by stupid, literal-thinking devices called computers.
Algorithms are the building blocks of all computer programs and Knuth’s masterpiece is devoted to their analysis. Are they finite (ie terminating after a finite number of steps)? Is each step precisely defined? What are its input and output? And is the algorithm effective? In the broad sweep of his magisterial inquiry, however, there is one question that Knuth never asks of an algorithm: what are its ethical implications?
One in three black men can expect to be incarcerated (compared with one in six Latinos and one in 17 whites)
That’s not a criticism, by the way. Such questions weren’t relevant to his project, which was to get computer science on to a solid foundation. Besides, he was writing in the 1960s when the idea that computers might have profound social, economic and political impacts was not on anybody’s radar. The thought that we would one day live in an “information society” that was comprehensively dependent on computers would have seemed fanciful to most people.
But that society has come to pass, and suddenly the algorithms that are the building blocks of this world have taken on a new significance because they have begun to acquire power over our everyday lives. They determine whether we can get a bank loan or a mortgage, and on what terms, for example; whether our names go on no-fly lists; and whether the local cops regard one as a potential criminal or not.
To take just one example, judges, police forces and parole officers across the US are now using a computer program to decide whether a criminal defendant is likely to reoffend or not. The basic idea is that an algorithm is likely to be more “objective” and consistent than the more subjective judgment of human officials. The algorithm in question is called Compas (Correctional Offender Management Profiling for Alternative Sanctions). When defendants are booked into jail, they respond to a Compas questionnaire and their answers are fed into the software to generate predictions of “risk of recidivism” and “risk of violent recidivism”.
It turns out that the algorithm is fairly good at predicting recidivism and less good at predicting the violent variety. So far, so good. But guess what? The algorithm is not colour blind. Black defendants who did not reoffend over a two-year period were nearly twice as likely to be misclassified as higher risk compared with their white counterparts; white defendants who reoffended within the next two years had been mistakenly labelled low risk almost twice as often as black reoffenders.
Facebook Twitter Pinterest Donald Knuth (left), who tackled algorithms in the 1960s in The Art of Computer Programming. Photograph: Bettmann Archive
We know this only because the ProPublica website undertook a remarkable piece of investigative reporting. Via a freedom of information request, the journalists obtained the Compas scores of nearly 12,000 offenders in Florida and then built a profile of each individual’s criminal history both before and after they were scored. The results of the analysis are pretty clear. If you’re black, the chances of being judged a potential reoffender are significantly higher than if you’re white. And yet those algorithmic predictions are not borne out by evidence.
A cynic might say that this is no surprise: racism runs through the US justice system like the message in a stick of rock. One in three black men can expect to be incarcerated in his lifetime (compared with one in six Latinos and one in 17 whites). That should be an argument for doing assessments and predictions using an algorithm rather than officials who may be prejudiced. And yet this analysis of the Compas system suggests that even the machine has a racial bias.
The big puzzle is how the bias creeps into the algorithm. We might be able to understand how if we could examine it. But most of these algorithms are proprietary and secret, so they are effectively “black boxes” – virtual machines whose workings are opaque. Yet the software inside them was written by human beings, most of whom were probably unaware that their work now has an important moral dimension. Perhaps Professor Knuth’s next book should be The Ethics of Computer Programming.
Imagine you were found guilty of a crime and were waiting to learn your sentence. Would you rather have your sentence determined by a computer algorithm, which dispassionately weights factors that predict your future risk of crime (such as age or past arrests) or by the subjective assessment of a judge? And would that choice change if you were a different race?
Technology is often held up as a way to reduce racial disparities in the criminal justice system. If existing disparities are due at least in part to the racial biases of witnesses, police, and judges, then replacing some human judgments with computer algorithms that estimate crime risk could produce a fairer system. But might those algorithms also exhibit racial bias? This is a good question, but not one that’s easy to answer using existing data.
A ProPublica story in May claimed that the risk scores calculated by the private firm NorthPointe are plagued by racial bias, systematically giving higher risk scores to blacks than to otherwise similar whites. If true, this is an important problem, as courts all over the country use risk scores to determine bail, sentencing, parole, and more. This study has received considerable media attention and was even cited by the Wisconsin Supreme Court in a legal decision limiting the use of risk assessments. Yet it contains several important errors of reasoning which call the conclusions into doubt.
The ProPublica study cites a disparity in “false-positive rates” as evidence of racial bias: black defendants who did not reoffend were more likely to have been classified as high risk than white defendants. As noted in NorthPointe’s response, and explained in a recent column by Robert Verbruggen, these statistics are dangerously misleading. Any group that has higher recidivism rates (and therefore higher risk scores, on average) will mechanically have a higher false positive rate, even if the risk score is completely unbiased.
(The basic intuition is this: the false positive rate is the number of people labeled high risk who don’t reoffend divided by the total number of people who do not reoffend. In a group with high recidivism rates, the numerator will be larger because the pool of people labeled high risk is bigger and the denominator will be smaller because there are fewer people who do not reoffend. The result is that the ratio of these numbers is always larger than it is for low-recidivism groups.)
It seems counterintuitive, but disparities in false positives don’t tell us anything about racial disparities in the algorithm. Disparate false-positive rates will be present every time there are disparate rates of reoffending, regardless of racial bias and regardless of whether the risk score is made by a computer algorithm or by the subjective assessment of a judge.
However, even if we look at the correct set of numbers, we face a bigger problem: risk scores influence sentencing, and sentencing influences recidivism. Consider a defendant who is ordered by the courts to undergo substance abuse counseling due to his high risk score. If he doesn’t reoffend is this because the risk score was wrong—or because the substance abuse counseling was effective? Consider a second defendant who received a prison sentence as a result of her high risk score. If she doesn’t reoffend is this because the risk score was wrong—or because she was in prison until she was too old for crime? Recidivism rates do not tell us what a person’s propensity to commit another crime wasat the time the risk score was calculated. And therefore they have limited use in determining the accuracy of those scores. A very nice paper by Shawn Bushway and Jeffrey Smith makes this point at length.
While methodological issues call ProPublica’s conclusions into question, potential racial bias in risk assessment remains an important issue. Close to 20 states are using risk assessment to help determine sentencing in at least some jurisdictions, and risk assessments in bail and parole are even more common. Considering that many of the inputs to risk assessments, such as past arrests, are subject to racially disparate policing practices, it would not be surprising if risk scores carried some of this bias over. These are complicated issues, and scholars such as Richard Berk and Sandra Mayson provide deeper analysis on how we should think of fairness and justice in risk assessment. But one of the most important policy questions is simple: do risk assessments increase or decrease racial disparities compared to the subjective decisions of judges?
An ideal approach to answering this question would be an experiment in which some judges are randomly assigned to use risk assessments as part of their decisions (their defendants are the treatment group), and some judges to operate as before (their defendants are the control group). We could then compare racial discrepancies in sentencing for the treatment group and the control group, to determine the effect of incorporating ri
Like a more crooked version of the Voight-Kampff test from Blade Runner, a new machine learning paper from a pair of Chinese researchers has delved into the controversial task of letting a computer decide on your innocence. Can a computer know if you're a criminal just from your face?
In their paper 'Automated Inference on Criminality using Face Images', published on the arXiv pre-print server, Xiaolin Wu and Xi Zhang from China's Shanghai Jiao Tong University investigate whether a computer can detect if a human could be a convicted criminal just by analysing his or her facial features. The two say their tests were successful, and that they even found a new law governing "the normality for faces of non-criminals."
They described the idea of algorithms that can match and exceed a human's performance in face recognition to infer criminality "irresistible". But as a number of Twitter users and commenters on Hacker News point out, by stuffing biases into artificial intelligence and machine learning algorithms, the computer could act on those biases. The researchers maintain that the data sets were controlled for race, gender, age, and facial expressions, though.
Imagine this with drones, every CCTV camera in every city, the eyes of self driving cars, everywhere there's a camera… Tim MaughanNovember 18, 2016
The images used in the research were standard ID photographs of Chinese males between the ages of 18 and 55, with no facial hair, scars, or other markings. Wu and Zhang stress that the ID photos used were not police mugshots, and that out of 730 criminals, 235 committed violent crimes "including murder, rape, assault, kidnap, and robbery."
The two state they purposely took away "any subtle human factors" out of the assessment process. As long as data sets are finely controlled, could human bias be completely eradicated? Wu told Motherboard that human bias didn't come into it. "In fact, we got our first batch of results a year ago. We went through very rigorous checking of our data sets, and also ran many tests searching for counterexamples but failed to find any," said Wu.
Here's how it worked: Xiaolin and Xi fed into a machine learning algorithm facial images of 1,856 people, of which half were convicted criminals, and then observed if any of their four classifiers—each using a different method of analysing facial features—could infer criminality.
They found that all four of their different classifiers were mostly successful, and that the faces of criminals and those not convicted of crimes differ in key ways that are perceptible to a computer program. Moreover, "the variation among criminal faces is significantly greater than that of the non-criminal faces," Xiaolin and Xi write.
"Also, we find some discriminating structural features for predicting criminality, such as lip curvature,"
"All four classifiers perform consistently well and produce evidence for the validity of automated face-induced inference on criminality, despite the historical controversy surrounding the topic," the researchers write. "Also, we find some discriminating structural features for predicting criminality, such as lip curvature, eye inner corner distance, and the so-called nose-mouth angle." The best classifier, known as the Convolutional Neural Network, achieved 89.51 percent accuracy in the tests.
"By extensive experiments and vigorous cross validations," the researchers conclude, "we have demonstrated that via supervised machine learning, data-driven face classifiers are able to make reliable inference on criminality."
While Xiaolin and Xi admit in their paper that they are "not qualified to discuss or to debate on societal stereotypes," the problem is that machine learning is adept at picking up on human biases in data sets and acting on those biases, as proved by multiple recent incidents. The pair admit they're on shaky ground. "We have been accused on Internet of being irresponsible socially," Wu said.
This paper is the exact reason why we need to think about ethics in AI. Stephen MayhewNovember 17, 2016
In the paper they go on to quote philosopher Aristotle, "It is possible to infer character from features," but that has to be left to human psychologists, not machines, surely? One major concern going forward is that of false positives—that is, identifying innocent people as guilty—especially if this program is used in any sort of real-world criminal justice settings. The researchers said the algorithms did throw up some false positives (identifying non-criminals as criminals) and false negatives (identifying criminals as non-criminals), which increased when the faces were randomly labeled for control tests.
Online critics have lambasted the paper. "I thought this was a joke when I read the abstract, but it appears to be a genuine paper," said a user on Hacker News. "I agree it's an entirely valid area of study…but to do it you need experts in criminology, physiology and machine learning, not just a couple of people who can follow the Keras instructions for how to use a neural net for classification."
Read more: Google-Backed A.I. Aims to Help Journalists Write Better News Stories
Others questioned the validity of the paper, noting that one of the researchers is listed as having a Gmail account. "First of all, I don't think this is satire. I'll admit that the use of a gmail account by a researcher at a Chinese uni is facially suspicious," posed another Hackers News reader.
Wu had an answer for this, however. "Some questioned why I used gmail address as a faculty member in China. In fact, I am also a professor at McMaster University, Canada," he told Motherboard.
Predicting the future is not only the provenance of fortune tellers or media pundits. Predictive algorithms, based on extensive datasets and statistics have overtaken wholesale and retail operations as any online shopper knows. And in the last few years algorithms, are used to automate decision making for bank loans, school admissions, hiring and infamously in predicting recidivism – the probability that a defendant will commit another crime in the next two years.
COMPAS, which stands for Correctional Offender Management Profiling for Alternative Sanctions, is such a program and was singled out by ProPublica earlier this year as being racially biased. COMPAS utilizes 137 variables in its proprietary and unpublished scoring algorithm; race is not one of those variables. ProPublica used a dataset of defendants in Broward County, Florida. The data included demographics, criminal history, a COMPAS score  and the criminal actions in the subsequent two years. ProPublica then crosslinked this data with the defendants’ race. Their findings are generally accepted by all sides
COMPAS is moderately accurate in identifying white and black recidivism about 60% of the time.
COMPAS’s errors reflect apparent racial bias. Blacks are more often wrongly identified as recidivist risks (statistically a false positive) and whites more often erroneously identified as not being a risk (a false negative).
The “penalty” for being misclassified as a higher risk is more likely to be stiffer punishment. Being misclassified as a lower risk is like a “Get out of jail” card.
As you might anticipate, significant controversy followed couched mostly in an academic fight over which statistical measures or calculations more accurately depicted the bias. A study in Science Advances revisits the discussion and comes to a different conclusion.
In the current study, assessment by humans was compared to that of COMPAS using that Broward County dataset. The humans were found on Amazon’s Mechanical Turk , paid a dollar to participate and a $5 bonus if they were accurate more than 60% of the time.
Humans were just as accurate as the algorithm (62 vs. 65%)
The errors by the algorithm and humans were identical, overpredicting (false positives) recidivism for black defendants and underpredicting (false negatives) for white defendants.
Humans COMPAS Black % White % Black % White % Accuracy* 68.2 67.6 64.9 65.7 False Positives 37.1 27.2 40.4 25.4 False Negatives 29.2 40.3 30.9 47.9
- Accuracy is the sum of true positives and true negatives statistically speaking
The human assessors used only seven variables, not the 137 of COMPAS.  This suggests that the algorithm was needlessly complex at least in deciding recidivism risk. In fact, the researchers found that just two variables, defendant age, and the number of prior convictions was as accurate as COMPAS’s predictions.
Of more significant interest is the finding that when human assessors were given the additional information regarding defendant race it had no impact. They were just as accurate and demonstrated the same racial disparity in false positives and negatives. Race was an associated confounder, but it was not the cause of the statistical difference. ProPublica’s narrative of racial bias was incorrect.
Algorithms are statistical models involving choices. If you optimize to find all the true positives, your false positives will increase. Lower your false positive rate and the false negatives increase. Do we want to incarcerate more or less? The MIT Technology Review puts it this way.
“Are we primarily interested in taking as few chances as possible that someone will skip bail or re-offend? What trade-offs should we make to ensure justice and lower the massive social costs of imprisonment?”
COMPAS is meant to serve as a decision aid. The purpose of the 137 variables is to create a variety of scales depicting substance abuse, environment, criminal opportunity, associates, etc. Its role is to assist the humans of our justice system in determining an appropriate punishment.  None of the studies, to my knowledge, looked at the sentences handed down. The current research ends as follows:
“When considering using software such as COMPAS in making decisions that will significantly affect the lives and well-being of criminal defendants, it is valuable to ask whether we would put these decisions in the hands of random people who respond to an online survey because, in the end, the results from these two approaches appear to be indistinguishable.”
The answer is no. COMPAS and similar algorithms are tools, not a replacement FOR human judgment. They facilitate but do not automate. ProPublica is correct when saying algorithmic decisions need to be understood by their human users and require continuous validation and refinement. But ProPublica’s narrative, that evil forces were responsible for a racially biased algorithm are not true.
 COMPAS is scored on a 1 to 10 scale, with scores greate
Caution is indeed warranted, according to Julia Dressel and Hany Farid from Dartmouth College. In a new study, they have shown that COMPAS is no better at predicting an individual’s risk of recidivism than random volunteers recruited from the internet.
“Imagine you’re a judge and your court has purchased this software; the people behind it say they have big data and algorithms, and their software says the defendant is high-risk,” says Farid. “Now imagine I said: Hey, I asked 20 random people online if this person will recidivate and they said yes. How would you weight those two pieces of data? I bet you’d weight them differently. But what we’ve shown should give the courts some pause.” (A spokesperson from Equivant declined a request for an interview.)
COMPAS has attracted controversy before. In 2016, the technology reporter Julia Angwin and colleagues at ProPublica analyzed COMPAS assessments for more than 7,000 arrestees in Broward County, Florida, and published an investigation claiming that the algorithm was biased against African Americans. The problems, they said, lay in the algorithm’s mistakes. “Blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend,” the team wrote. And COMPAS “makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower-risk but go on to commit other crimes.”
Northpointe questioned ProPublica’s analysis, as did various academics. They noted, among other rebuttals, that the program correctly predicted recidivism in both white and black defendants at similar rates. For any given score on COMPAS’s 10-point scale, white and black people are just as likely to re-offend as each other. Others have noted that this debate hinges on one’s definition of fairness, and that it’s mathematically impossible to satisfy the standards set by both Northpointe and ProPublica—a story at The Washington Post clearly explains why.
The debate continues, but when Dressel read about it, she realized that it masked a different problem. “There was this underlying assumption in the conversation that the algorithm’s predictions were inherently better than human ones,” she says, “but I couldn’t find any research proving that.” So she and Farid did their own.
They recruited 400 volunteers through a crowdsourcing site. Each person saw short descriptions of defendants from ProPublica’s investigation, highlighting seven pieces of information. Based on that, they had to guess if the defendant would commit another crime within two years.
On average, they got the right answer 63 percent of their time, and the group’s accuracy rose to 67 percent if their answers were pooled. COMPAS, by contrast, has an accuracy of 65 percent. It’s barely better than individual guessers, and no better than a crowd. “These are nonexperts, responding to an online survey with a fraction of the amount of information that the software has,” says Farid. “So what exactly is software like COMPAS doing?”
Don’t blame the algorithm — as long as there are racial disparities in the justice system, sentencing software can never be entirely fair.
For generations, the Maasai people of eastern Africa have passed down the story of a tireless old man. He lived alone and his life was not easy. He spent every day in the fields — tilling the land, tending the animals, and gathering water. The work was as necessary as it was exhausting. But the old man considered himself fortunate. He had a good life, and never really gave much thought to what was missing. One morning the old man was greeted with a pleasant surprise. Standing in his kitchen was a young boy, perhaps seven or eight years old. The old man had never seen him before. The boy smiled but said nothing. The old man looked around. His morning breakfast had already been prepared, just as he liked it. He asked the boy’s name. “Kileken,” the boy replied. After some prodding, the boy explained that, before preparing breakfast, he had completed all of the old man’s work for the day. Incredulous, the old man stepped outside. Indeed, the fields had been tilled, the animals tended, and the water gathered. Astonishment written all over his face, the old man staggered back into the kitchen. “How did this happen? And how can I repay you?” The boy smiled again, this time dismissively. “I will accept no payment. All I ask is that you let me stay with you.” The old man knew better than to look a gift horse in the mouth. Kileken and the old man soon became inseparable, and the farm grew lush and bountiful as it never had before. The old man could hardly remember what life was like before the arrival of his young comrade. There could be no doubt: With Kileken’s mysterious assistance, the old man was prospering. But he never quite understood why, or how, it had happened.
To an extent we have failed to fully acknowledge, decision-making algorithms have become our society’s collective Kileken. They show up unannounced and where we least expect them, promise and often deliver great things, and quickly come to be seen as indispensable. Their reach can’t be overestimated. They tell traders what stocks to buy and sell, determine how much our car insurance costs, influence which products Amazon shows us and how much we get charged for them, and interpret our Google searches and rank their results.
These algorithms improve the efficiency and accuracy of services we all rely on, create new products we never before could have imagined, relieve people of tedious work, and are an engine of seemingly unbounded economic growth. They also permeate areas of social decision-making that have traditionally been left to direct human judgment, like romantic matchmaking and criminal sentencing. Yet they are largely hidden from view, remain opaque even when we are prompted to examine them, and are rarely subject to the same checks and balances as human decision-makers.
Worse yet, some of these algorithms seem to reflect back to us society’s ugliest prejudices. Last April, for instance, our Facebook feeds — curated by a labyrinth of algorithms — were inundated with stories about FaceApp, a program that applied filters to uploaded photographs so that the user would appear younger or older or more attractive. At first, this app seemed to be just another clever pitch to the Snapchat generation. But things quickly went sideways when users discovered that the app’s “hot” filter — which purported to transform regular Joes and Jills into beautiful Adonises and Aphrodites — made skin lighter, eyes rounder, and noses smaller. The app appeared to be equating physical attractiveness with European facial features. The backlash was swift, ruthless, and seemingly well-deserved. The app — and, it followed, the algorithm it depended on — appeared to be racist. The company first renamed the “hot” filter to “exclude any positive connotation associated with it,” before unceremoniously pulling it from the app altogether.
FaceApp’s hot filter was far from the first algorithm to be accused of racism, and certainly won’t be the last. Google’s autocomplete feature — which relies on an algorithm that scans other users’ previous searches to try to guess your query — is regularly chastised for shining a spotlight on racist, sexist, and other regressive sentiments that would otherwise remain tucked away in the darkest corners of the Internet and our psyches.
But while rogue apps or discomfiting autocomplete suggestions are both ephemeral and potentially responsive to public outcry, the same can hardly be said about the insidious encroachment of decision-making algorithms into the workings of our legal system, where they frequently play a critical role in determining the fates of defendants — and, like FaceApp, often exhibit a preference for white subjects. But the problem is more than skin deep. The issue is that we cannot escape the long arm of America’s h
Although crime rates have fallen steadily since the 1990s, rates of recidivism remain a factor in the areas of both public safety and prisoner management. The National Institute of Justice defines recidivism as “criminal acts that resulted in rearrest, re-conviction or return to prison with or without a new sentence,” and with over 75 percent of released prisoners rearrested within five years, it’s apparent there’s room for improvement. In an effort to streamline sentencing, reduce recidivism and increase public safety, private companies have developed criminal justice algorithms for use in the courtroom. These tools — sometimes called “risk assessments” — aim to recommend sentence length and severity specific to each defendant based on a set of proprietary formulae. Unfortunately, the algorithms’ proprietary nature means that neither attorneys nor the general public have access to information necessary to understand or defend against these assessments.
There are dozens of these algorithms currently in use at federal and state levels across the nation. One of the most controversial, the Correctional Offender Management Profiling for Alternative Sanctions or COMPAS, made headlines in 2016 when defendant Eric Loomis received a six-year sentence for reckless endangerment, eluding police, driving a car without the owner’s consent, possession of firearm, probation violation and resisting arrest — a sentence partially based on his COMPAS score. Loomis, a registered sex offender, challenged the verdict, claiming COMPAS violated his constitutional right of due process because he could not mount a proper challenge. His argument was two-fold: that the proprietary nature of the formula denied him and his defense team access to his data, and that COMPAS takes into account race and gender when predicting outcomes, which constitutes bias. His case was denied by the lower court, but Loomis refused to back down, instead appealing to the Wisconsin Supreme Court.
In July of 2016, a unanimous decision by the Wisconsin Supreme Court upheld the state’s decision to use automated programs to determine sentencing. In her opinion, Justice Ann Walsh Bradley wrote: “Although it cannot be determinative, a sentencing court may use a COMPAS risk assessment as a relevant factor for such matters as: (1) diverting low-risk prison-bound offenders to a non-prison alternative; (2) assessing whether an offender can be supervised safely and effectively in the community; and (3) imposing terms and conditions of probation, supervision, and responses to violations.” In response to Loomis’ contention that race and particularly gender can skew results and interfere with due process, Bradley further explained that “considering gender in a COMPAS risk assessment is necessary to achieve statistical accuracy." Her opinion further cautioned that judges should be made aware of potential limitations of risk assessment tools and suggested guidelines for use such as quality control and validation checks on the software as well as user education.
A report from the Electronic Privacy Information Center (EPIC), however, warns that in many cases issues of validity and training are overlooked rather than addressed. To underscore their argument, EPIC, a public interest research center that focuses public attention on emerging privacy and civil liberties issues, compiled a chart matching states with the risk assessment tools used in their sentencing practices. They found more than 30 states that have never run a validation process on the algorithms in use within their state, suggesting that most of the time these programs are used without proper calibration.
The Problem with COMPAS
In states using COMPAS, defendants are asked to fill out a COMPAS questionnaire when they are booked into the criminal justice system. Their answers are analyzed by the proprietary COMPAS software, which generates predictive scores such as “risk of recidivism” and “risk of violent recidivism.”
These scores, calculated by the algorithm on a one-to-10 scale, are shown in a bar chart with 10 representing those most likely to reoffend. Judges receive these charts before sentencing to assist with determinations. COMPAS is not the only element a judge is supposed to consider when determining length and severity of sentence. Past criminal history, the circumstances of the crime (whether there was bodily harm committed or whether the offender was under personal stress) and whether or not the offender exhibits remorse are some examples of mitigating factors affecting sentencing. However, there is no way of telling how much weight a judge assigns to the information received from risk assessment software.
Taken on its own, the COMPAS chart seems like a reasonable, even helpful, bit of information; but the reality is much different. ProPublica conducted an analysis of the COMPAS algorithm and uncovered some valid concerns about the reliability and bias of the software.
In an analysis of over 10,000
Invisible algorithms increasingly shape the world we live in, and not always for the better. Unfortunately, few mechanisms are in place to ensure they’re not causing more harm than good.
That might finally be changing: A first-in-the-nation bill, passed yesterday in New York City, offers a way to help ensure the computer codes that governments use to make decisions are serving justice rather than inequality.
Computer algorithms are a series of steps or instructions designed to perform a specific task or solve a particular problem. Algorithms inform decisions that affect many aspects of society. These days, they can determine which school a child can attend, whether a person will be offered credit from a bank, what products are advertised to consumer, and whether someone will receive an interview for a job. Government officials also use them to predict where crimes will take place, who is likely to commit a crime and whether someone should be allowed out of jail on bail.
Algorithms are often presumed to be objective, infallible, and unbiased. In fact, they are highly vulnerable to human bias. And when algorithms are flawed, they can have serious consequences.
Just recently, a highly controversial DNA testing technique used by New York City’s medical examiner put thousands of criminal cases in jeopardy. Flawed code can also further entrench systemic inequalities. The algorithms used in facial recognition technology, for example, have been shown to be less accurate on Black people, women, and juveniles, putting innocent people at risk of being labeled crime suspects. And a ProPublica study has found that tools designed to determine the likelihood of future criminal activity made incorrect predictions that were biased against Black people. These tools are used to make bail and sentencing decisions, replicating the racism in the criminal justice system under a guise of technological neutrality.
But even when we know an algorithm is racist, it’s not so easy to understand why. That’s in part because algorithms are usually kept secret. In some cases, they are deemed proprietary by the companies that created them, who often fight tooth and nail to prevent the public from accessing the source code behind them. That secrecy makes it impossible to fix broken algorithms.
The New York City Council yesterday passed legislation that we are hopeful will move us toward addressing these problems. New York City already uses algorithms to help with a broad range of tasks: deciding who stays in and who gets out of jail, teacher evaluations, firefighting, identifying serious pregnancy complications, and much more. The NYPD also previously used an algorithm-fueled software program developed by Palantir Technologies that takes arrest records, license-plate scans, and other data, and then graphs that data to supposedly help reveal connections between people and even crimes. The department since developed its own software to perform a similar task.
The bill, which is expected to be signed by Mayor Bill de Blasio, will provide a greater understanding of how the city’s agencies use algorithms to deliver services while increasing transparency around them. This bill is the first in the nation to acknowledge the need for transparency when governments use algorithms and to consider how to assess whether their use results in biased outcomes and how negative impacts can be remedied.
The legislation will create a task force to review New York City agencies’ use of algorithms and the policy issues they implicate. The task force will be made up of experts on transparency, fairness, and staff from non-profits that work with people most likely to be harmed by flawed algorithms. It will develop a set of recommendations addressing when and how algorithms should be made public, how to assess whether they are biased, and the impact of such bias.
These are extremely thorny questions, and as a result, there are some things left unanswered in bill. It doesn’t spell out, for example, whether the task force will require all source code underlying algorithms to be made public or if disclosing source code will depend on the algorithm and its context. While we believe strongly that allowing outside researchers to examine and test algorithms is key to strengthening these systems, the task force is charged with the responsibility of recommending the right approach.
Similarly, the bill leaves it to the task force to determine when an algorithm disproportionately harms a particular group of New Yorkers — based upon race, religion, gender, or a number of other factors. Because experts continue to debate this difficult issue, rigorous and thoughtful work by the task force will be crucial to protecting New Yorkers’ rights.
The New York Civil Liberties Union testified in support of an earlier version of the bill in October, but we will be watching to see the exact makeup of the task force, what recommendations are advanced, and whether de Blasio acts on them. We wil
Open up the photo app on your phone and search “dog,” and all the pictures you have of dogs will come up. This was no easy feat. Your phone knows what a dog “looks” like.
This modern-day marvel is the result of machine learning, a form of artificial intelligence. Programs like this comb through millions of pieces of data and make correlations and predictions about the world. Their appeal is immense: Machines can use cold, hard data to make decisions that are sometimes more accurate than a human’s.
But machine learning has a dark side. If not used properly, it can make decisions that perpetuate the racial biases that exist in society. It’s not because the computers are racist. It’s because they learn by looking at the world as the way it is, not as it ought to be.
Recently, newly elected Rep. Alexandria Ocasio-Cortez (D-NY) made this point in a discussion at a Martin Luther King Jr. Day event in New York City.
“Algorithms are still made by human beings, and those algorithms are still pegged to basic human assumptions,” she told writer Ta-Nehisi Coates at the annual MLK Now event. “They’re just automated assumptions. And if you don’t fix the bias, then you are just automating the bias.”
The next day, the conservative website the Daily Wire derided the comments.
But Ocasio-Cortez is right, and it’s worth reflecting on why.
If we’re not careful, AI will perpetuate bias in our world. Computers learn how to be racist, sexist, and prejudiced in a similar way that a child does, as computer scientist Aylin Caliskan, now at George Washington University, told me in a 2017 interview. The computers learn from their creators — us.
“Many people think machines are not biased,” Caliskan, who was at Princeton at the time, said. “But machines are trained on human data. And humans are biased.”
We think artificial intelligence is impartial. Often, it’s not.
Nearly all new consumer technologies use machine learning in some way. Take Google Translate: No person instructed the software to learn how to translate Greek to French and then to English. It combed through countless reams of text and learned on its own. In other cases, machine learning programs make predictions about which résumés are likely to yield successful job candidates, or how a patient will respond to a particular drug.
Machine learning is a program that sifts through billions of data points to solve problems (such as “can you identify the animal in the photo”), but it doesn’t always make clear how it has solved the problem. And it’s increasingly clear these programs can develop biases and stereotypes without us noticing.
In 2016, ProPublica published an investigation on a machine learning program that courts use to predict who is likely to commit another crime after being booked. The reporters found that the software rated black people at a higher risk than whites.
“Scores like this — known as risk assessments — are increasingly common in courtrooms across the nation,” ProPublica explained. “They are used to inform decisions about who can be set free at every stage of the criminal justice system, from assigning bond amounts … to even more fundamental decisions about defendants’ freedom.”
The program learned about who is most likely to end up in jail from real-world incarceration data. And historically, the real-world criminal justice system has been unfair to black Americans.
This story reveals a deep irony about machine learning. The appeal of these systems is they can make impartial decisions, free of human bias. “If computers could accurately predict which defendants were likely to commit new crimes, the criminal justice system could be fairer and more selective about who is incarcerated and for how long,” ProPublica wrote.
But what happened was that machine learning programs perpetuated our biases on a large scale. So instead of a judge being prejudiced against African Americans, it was a robot.
Other cases are more ambiguous. In China, researchers paired facial recognition technology with machine learning to look at driver’s license photos and predict who is a criminal. It purported to have an accuracy of 89.5 percent.
Many experts were extremely skeptical of the findings. Which facial features were this program picking up on for the analysis? Was it the physical features of certain ethnic groups that are discriminated against in the justice system? Is it picking up on the signs of a low-socioeconomic upbringing that may leave lasting impressions on our faces?
It can be hard to know. (Scarier: There’s one startup called Faception that claims it can detect terrorists or pedophiles just by looking at faces.)
“You got the algorithms which are super powerful, but just as important is what kind of data you feed the algorithms to teach them to discriminate,” Princeton psychologist and facial perception expert Alexander Todorov told me in a 2017 interview, while discussing a controversial paper on using machine learning to predict sexual orientation from faces. “If
As a child, you develop a sense of what “fairness” means. It’s a concept that you learn early on as you come to terms with the world around you. Something either feels fair or it doesn’t.
But increasingly, algorithms have begun to arbitrate fairness for us. They decide who sees housing ads, who gets hired or fired, and even who gets sent to jail. Consequently, the people who create them—software engineers—are being asked to articulate what it means to be fair in their code. This is why regulators around the world are now grappling with a question: How can you mathematically quantify fairness?
This story attempts to offer an answer. And to do so, we need your help. We’re going to walk through a real algorithm, one used to decide who gets sent to jail, and ask you to tweak its various parameters to make its outcomes more fair. (Don’t worry—this won’t involve looking at code!)
The algorithm we’re examining is known as COMPAS, and it’s one of several different “risk assessment” tools used in the US criminal legal system.
At a high level, COMPAS is supposed to help judges determine whether a defendant should be kept in jail or be allowed out while awaiting trial. It trains on historical defendant data to find correlations between factors like someone’s age and history with the criminal legal system, and whether or not the person was rearrested. It then uses the correlations to predict the likelihood that a defendant will be arrested for a new crime during the trial-waiting period.1
- Arrests vs. convictions
This process is highly imperfect. The tools use arrests as a proxy for crimes, but there are actually big discrepancies between the two because police have a history of disproportionately arresting racial minorities and of manipulating data. Rearrests, moreover, are often made for technical violations, such as failing to appear in court, rather than for repeat criminal activity. In this story, we oversimplify to examine what would happen if arrests corresponded to actual crimes.
This prediction is known as the defendant’s “risk score,” and it’s meant as a recommendation: “high risk” defendants should be jailed to prevent them from causing potential harm to society; “low risk” defendants should be released before their trial. (In reality, judges don’t always follow these recommendations, but the risk assessments remain influential.)
Proponents of risk assessment tools argue that they make the criminal legal system more fair. They replace judges’ intuition and bias—in particular, racial bias—with a seemingly more “objective” evaluation. They also can replace the practice of posting bail in the US, which requires defendants to pay a sum of money for their release. Bail discriminates against poor Americans and disproportionately affects black defendants, who are overrepresented in the criminal legal system.
- ProPublica’s methodology
For defendants who were jailed before trial, ProPublica looked at whether they were rearrested within two years after their release. It then used that to approximate whether the defendants would have been rearrested pre-trial had they not been jailed.
As required by law, COMPAS doesn’t include race in calculating its risk scores. In 2016, however, a ProPublica investigation argued that the tool was still biased against blacks. ProPublica found that among defendants who were never rearrested, black defendants were twice as likely as white ones to have been labeled high-risk by COMPAS.2
So our task now is to try to make COMPAS better. Ready?
Let’s start with the same data set that ProPublica used in its analysis. It includes every defendant scored by the COMPAS algorithm in Broward County, Florida, from 2013 to 2014. In total, that’s over 7,200 profiles with each person’s name, age, race, and COMPAS risk score, noting whether the person was ultimately rearrested either after being released or jailed pre-trial.
To make the data easier to visualize, we’ve randomly sampled 500 black and white defendants from the full set.
We’ve represented each defendant as a dot.
Remember: all these dots are people accused (but not convicted) of a crime. Some will be jailed pre-trial; others will be released immediately. Some will go on to get rearrested after their release; others will not. We want to compare two things: the predictions (which defendants received “high” vs. “low” risk scores) and the real-world outcomes (which defendants actually got rearrested after being released).
COMPAS scores defendants on a scale of 1 to 10, where 1 roughly corresponds to a 10% chance of rearrest, 2 to 20%, and so on.
Let’s look at how COMPAS scored everyone.
- COMPAS’s scores
COMPAS was designed to make aggregate predictions about groups of people who share similar characteristics, rather than predictions about specific individuals. The methodology behind its scores and the recommendations for how to use them are more complicated than we had room to present; you can read about them at the link above.
Though COMPAS can only offer a statistical probability that a defendant will be rearrested pre-trial, judges, of course, have to make an all-or-nothing decision: whether to release or detain the defendant. For the purposes of this story, we are going to use COMPAS’s “high risk” threshold, a score of 7 or higher, to represent a recommendation that a defendant be detained.3
From here on out, you are in charge. Your mission is to redesign the last stage of this algorithm by finding a fairer place to set the “high risk” threshold.
This is what your threshold will look like. Try clicking on it and dragging it around.
So first, let’s imagine the best-case scenario: all the defendants your algorithm labels with a high risk score go on to get rearrested, and all defendants who get a low risk score do not. Below, our graphic depicts what this might look like. The filled-in circles are defendants who were rearrested; the empty circles are those who weren’t.
Now move the threshold to make your algorithm as fair as possible.
(In other words, only rearrested defendants should be jailed.)
Great! That was easy. Your threshold should be set between 6 and 7. No one was needlessly detained, and no one who was released was then rearrested.
But of course, this ideal scenario never actually happens. It’s impossible to perfectly predict the outcome for each person. This means the filled and empty dots can’t be so neatly separated.
So here’s who actually gets rearrested.
Now move the threshold again to make your algorithm as fair as possible.
(Hint: you want to maximize its accuracy.)
You’ll notice that no matter where you place the threshold, it’s never perfect: we always jail some defendants who don’t get rearrested (empty dots to the right of the threshold) and release some defendants who do get rearrested (filled dots to the left of threshold). This is a trade-off that our criminal legal system has always dealt with, and it’s no different when we use an algorithm.
To make these trade-offs more clear, let’s see the percentage of incorrect predictions COMPAS makes on each side of the threshold, instead of just measuring the overall accuracy. Now we will be able to explicitly see whether our threshold favors needlessly keeping people in jail or releasing people who are then rearrested.4 Notice that COMPAS’s default threshold favors the latter.
- Technical definitions
These two error percentages are also known as the “false negative rate” (which we’ve labeled “released but rearrested”) and “false positive rate” (which we’ve labeled “needlessly jailed”).
How should we fairly balance this trade-off? There’s no universal answer, but in the 1760s, the English judge William Blackstone wrote, “It is better that ten guilty persons escape than that one innocent suffer.”
Blackstone’s ratio is still highly influential in the US today. So let’s use it for inspiration.
Move the threshold to where the “released but rearrested” percentage is roughly 10 times the “needlessly jailed” percentage.
You can already see two problems with using an algorithm like COMPAS. The first is that better prediction can always help reduce error rates across the board, but it can never eliminate them entirely. No matter how much data we collect, two people who look the same to the algorithm can always end up making different choices.
The second problem is that even if you follow COMPAS’s recommendations consistently, someone—a human—has to first decide where the “high risk” threshold should lie, whether by using Blackstone’s ratio or something else. That depends on all kinds of considerations—political, economic, and social.
Now we’ll come to a third problem. This is where our explorations of fairness start to get interesting. How do the error rates compare across different groups? Are there certain types of people who are more likely to get needlessly detained?
Let’s see what our data looks like when we consider the defendants’ race.
Now move each threshold to see how it affects black and white defendants differently.
Race is an example of a protected class in the US, which means discrimination on that basis is illegal. Other protected classes include gender, age, and disability.
Now that we’ve separated black and white defendants, we’ve discovered that even though race isn’t used to calculate the COMPAS risk scores, the scores have different error rates for the two groups. At the default COMPAS threshold between 7 and 8, 16% of black defendants who don’t get rearrested have been needlessly jailed, while the same is true for only 7% of white defendants. That doesn’t seem fair at all! This is exactly what ProPublica highlighted in its investigation.
Okay, so let’s fix this.
Move each threshold so white and black defendants are needlessly jailed at roughly the same rate.
(There are a number of solutions. We’ve picked one, but you can try to find others.)
We tried to reach Blackstone’s ratio again, so we arrived at the following solution: white defendants have a threshold between 6 and 7, while black defendants have a threshold between 8 and 9. Now roughly 9% of both black and white defendants who don’t get rearrested are needlessly jailed, while 75% of those who do are rearrested after spending no time in jail. Good work! Your algorithm seems much fairer than COMPAS now.
But wait—is it? In the process of matching the error rates between races, we lost something important: our thresholds for each group are in different places, so our risk scores mean different things for white and black defendants.
White defendants get jailed for a risk score of 7, but black defendants get released for the same score. This, once again, doesn’t seem fair. Two people with the same risk score have the same probability of being rearrested, so shouldn’t they receive the same treatment? In the US, using different thresholds for different races may also raise complicated legal issues with the 14th Amendment, the equal protection clause of the Constitution.
So let’s try this one more time with a single threshold shared between both groups.
Move the threshold again so white and black defendants are needlessly jailed at the same rate.
If you’re getting frustrated, there’s good reason. There is no solution.
We gave you two definitions of fairness: keep the error rates comparable between groups, and treat people with the same risk scores in the same way. Both of these definitions are totally defensible! But satisfying both at the same time is impossible.
The reason is that black and white defendants are rearrested at different rates. Whereas 52% of black defendants were rearrested in our Broward County data, only 39% of white defendants were. There’s a similar difference in many jurisdictions across the US, in part because of the country’s history of police disproportionately targeting minorities (as we previously mentioned).
Predictions reflect the data used to make them—whether by algorithm or not. If black defendants are arrested at a higher rate than white defendants in the real world, they will have a higher rate of predicted arrest as well. This means they will also have higher risk scores on average, and a larger percentage of them will be labeled high-risk—both correctly and incorrectly. This is true no matter what algorithm is used, as long as it’s designed so that each risk score means the same thing regardless of race.
This strange conflict of fairness definitions isn’t just limited to risk assessment algorithms in the criminal legal system. The same sorts of paradoxes hold true for credit scoring, insurance, and hiring algorithms. In any context where an automated decision-making system must allocate resources or punishments among multiple groups that have different outcomes, different definitions of fairness will inevitably turn out to be mutually exclusive.
There is no algorithm that can fix this; this isn’t even an algorithmic problem, really. Human judges are currently making the same sorts of forced trade-offs—and have done so throughout history.
But here’s what an algorithm has changed. Though judges may not always be transparent about how they choose between different notions of fairness, people can contest their decisions. In contrast, COMPAS, which is made by the private company Northpointe, is a trade secret that cannot be publicly reviewed or interrogated. Defendants can no longer question its outcomes, and government agencies lose the ability to scrutinize the decision-making process. There is no more public accountability.
So what should regulators do? The proposed Algorithmic Accountability Act of 2019 is an example of a good start, says Andrew Selbst, a law professor at the University of California who specializes in AI and the law. The bill, which seeks to regulate bias in automated decision-making systems, has two notable features that serve as a template for future legislation. First, it would require companies to audit their machine-learning systems for bias and discrimination in an “impact assessment.” Second, it doesn’t specify a definition of fairness.
“With an impact assessment, you're being very transparent about how you as a company are approaching the fairness question,” Selbst says. That brings public accountability back into the debate. Because “fairness means different things in different contexts,” he adds, avoiding a specific definition allows for that flexibility.
But whether algorithms should be used to arbitrate fairness in the first place is a complicated question. Machine-learning algorithms are trained on “data produced through histories of exclusion and discrimination,” writes Ruha Benjamin, an associate professor at Princeton University, in her book Race After Technology. Risk assessment tools are no different. The greater question about using them—or any algorithms used to rank people—is whether they reduce existing inequities or make them worse.
Selbst recommends proceeding with caution: “Whenever you turn philosophical notions of fairness into mathematical expressions, they lose their nuance, their flexibility, their malleability,” he says. “That’s not to say that some of the efficiencies of doing so won’t eventually be worthwhile. I just have my doubts.”
Did our AI mess up? Flag the unrelated incidents