Incident 40: COMPAS Algorithm Performs Poorly in Crime Recidivism Prediction

Description: Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), a recidivism risk-assessment algorithmic tool used in the judicial system to assess likelihood of defendants' recidivism, is found to be less accurate than random untrained human evaluators.
Alleged: Equivant developed and deployed an AI system, which harmed Accused People.

Suggested citation format

Anonymous. (2016-05-23) Incident Number 40. in McGregor, S. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
40
Report Count
22
Incident Date
2016-05-23
Editors
Sean McGregor

Tools

New ReportNew ReportDiscoverDiscover

CSET Taxonomy Classifications

Taxonomy Details

Full Description

In 2018, researchers at Dartmouth College conducted a study comparing the Correctional Offender Management Profiling for Alternative Sanctions' (COMPAS), a recidivism risk-assessment algorithmic tool, and 462 random untrained human subjects' ability to predict criminals' risk of recidivism. Researchers gave the subjects descriptions of defendents, highlighting seven pieces of information, and asked subjects to rate the risk of a defendant's recidivism from 1-10. The pooled judgment of these untrained subjects' was accurate 67% of the time, compared to COMPAS's accuracy rate of 65%.

Short Description

Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), a recidivism risk-assessment algorithmic tool used in the judicial system to assess likelihood of defendants' recidivism, is found to be less accurate than random untrained human evaluators.

Severity

Minor

Harm Type

Harm to social or political systems

AI System Description

predictive self-assessment algorithm that produces scores correlating to subject's recidivism risk

System Developer

Equivant

Sector of Deployment

Public administration and defence

Relevant AI functions

Perception, Cognition, Action

AI Techniques

law enforcement algorithm

AI Applications

risk assessment

Location

USA

Named Entities

Dartmouth College, Equivant

Technology Purveyor

Equivant

Beginning Date

2018-01-17T08:00:00.000Z

Ending Date

2018-01-17T08:00:00.000Z

Near Miss

Near miss

Intent

Accident

Lives Lost

No

Infrastructure Sectors

Government facilities

Data Inputs

Questionnaire consisting of 137 factors like age, prior convictions, criminal records

Incidents Reports

← Read the story

Across the nation, judges, probation and parole officers are increasingly using algorithms to assess a criminal defendant’s likelihood of becoming a recidivist – a term used to describe criminals who re-offend. There are dozens of these risk assessment algorithms in use. Many states have built their own assessments, and several academics have written tools. There are also two leading nationwide tools offered by commercial vendors.

We set out to assess one of the commercial tools made by Northpointe, Inc. to discover the underlying accuracy of their recidivism algorithm and to test whether the algorithm was biased against certain groups.

Our analysis of Northpointe’s tool, called COMPAS (which stands for Correctional Offender Management Profiling for Alternative Sanctions), found that black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than black defendants to be incorrectly flagged as low risk.

We looked at more than 10,000 criminal defendants in Broward County, Florida, and compared their predicted recidivism rates with the rate that actually occurred over a two-year period. When most defendants are booked in jail, they respond to a COMPAS questionnaire. Their answers are fed into the COMPAS software to generate several scores including predictions of “Risk of Recidivism” and “Risk of Violent Recidivism.”

We compared the recidivism risk categories predicted by the COMPAS tool to the actual recidivism rates of defendants in the two years after they were scored, and found that the score correctly predicted an offender’s recidivism 61 percent of the time, but was only correct in its predictions of violent recidivism 20 percent of the time.

In forecasting who would re-offend, the algorithm correctly predicted recidivism for black and white defendants at roughly the same rate (59 percent for white defendants, and 63 percent for black defendants) but made mistakes in very different ways. It misclassifies the white and black defendants differently when examined over a two-year follow-up period.

Our analysis found that:

Black defendants were often predicted to be at a higher risk of recidivism than they actually were. Our analysis found that black defendants who did not recidivate over a two-year period were nearly twice as likely to be misclassified as higher risk compared to their white counterparts (45 percent vs. 23 percent).

White defendants were often predicted to be less risky than they were. Our analysis found that white defendants who re-offended within the next two years were mistakenly labeled low risk almost twice as often as black re-offenders (48 percent vs. 28 percent).

The analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 45 percent more likely to be assigned higher risk scores than white defendants.

Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.

The violent recidivism analysis also showed that even when controlling for prior crimes, future recidivism, age, and gender, black defendants were 77 percent more likely to be assigned higher risk scores than white defendants.

Previous Work

In 2013, researchers Sarah Desmarais and Jay Singh examined 19 different recidivism risk methodologies being used in the United States and found that “in most cases, validity had only been examined in one or two studies conducted in the United States, and frequently, those investigations were completed by the same people who developed the instrument.”

Their analysis of the research published before March2013 found that the tools “were moderate at best in terms of predictive validity,” Desmarais said in an interview. And she could not find any substantial set of studies conducted in the United States that examined whether risk scores were racially biased. “The data do not exist,” she said.

The largest examination of racial bias in U.S. risk assessment algorithms since then is a 2016 paper by Jennifer Skeem at University of California, Berkeley and Christopher T. Lowenkamp from the Administrative Office of the U.S. Courts. They examined data about 34,000 federal offenders to test the predictive validity of the Post Conviction Risk Assessment tool that was developed by the federal courts to help probation and parole officers determine the level of supervision required for an inmate upon release.

The authors found that the average risk score for black offenders was higher than for white offenders, but that concluded the differences were not attributable to bias.

A 2013 study analyzed the predictive validity among various races for another score called the Level of Service Inventory, on

How We Analyzed the COMPAS Recidivism Algorithm

It was a striking story. “Machine Bias,” the headline read, and the teaser proclaimed: “There’s software used across the country to predict future criminals. And it’s biased against blacks.”

ProPublica, a Pulitzer Prize–winning nonprofit news organization, had analyzed risk assessment software known as COMPAS. It is being used to forecast which criminals are most likely to ­reoffend. Guided by such forecasts, judges in courtrooms throughout the United States make decisions about the future of defendants and convicts, determining everything from bail amounts to sentences. When ProPublica compared COMPAS’s risk assessments for more than 10,000 people arrested in one Florida county with how often those people actually went on to reoffend, it discovered that the algorithm “correctly predicted recidivism for black and white defendants at roughly the same rate.” But when the algorithm was wrong, it was wrong in different ways for blacks and whites. Specifically, “blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend.” And COMPAS tended to make the opposite mistake with whites: “They are much more likely than blacks to be labeled lower risk but go on to commit other crimes.”

Whether it’s appropriate to use systems like COMPAS is a question that goes beyond racial bias. The U.S. Supreme Court might soon take up the case of a Wisconsin convict who says his right to due process was violated when the judge who sentenced him consulted COMPAS, because the workings of the system were opaque to the defendant. Potential problems with other automated decision-making (ADM) systems exist outside the justice system, too. On the basis of online personality tests, ADMs are helping to determine whether someone is the right person for a job. Credit-scoring algorithms play an enormous role in whether you get a mortgage, a credit card, or even the most cost-effective cell-phone deals.

It’s not necessarily a bad idea to use risk assessment systems like COMPAS. In many cases, ADM systems can increase fairness. Human decision making is at times so incoherent that it needs oversight to bring it in line with our standards of justice. As one specifically unsettling study showed, parole boards were more likely to free convicts if the judges had just had a meal break. This probably had never occurred to the judges. An ADM system could discover such inconsistencies and improve the process.

But often we don’t know enough about how ADM systems work to know whether they are fairer than humans would be on their own. In part because the systems make choices on the basis of underlying assumptions that are not clear even to the systems’ designers, it’s not necessarily possible to determine which algorithms are biased and which ones are not. And even when the answer seems clear, as in ­ProPublica’s findings on COMPAS, the truth is sometimes more complicated.

Lawmakers, the courts, and an informed public should decide what we want algorithms to prioritize.

What should we do to get a better handle on ADMs? Democratic societies need more oversight over such systems than they have now. AlgorithmWatch, a Berlin-based nonprofit advocacy organization that I cofounded with a computer scientist, a legal philosopher, and a fellow journalist, aims to help people understand the effects of such systems. “The fact that most ADM procedures are black boxes to the people affected by them is not a law of nature. It must end,” we assert in our manifesto. Still, our take on the issue is different from many critics’—because our fear is that the technology could be demonized undeservedly. What’s important is that societies, and not only algorithm makers, make the value judgments that go into ADMs.

Measures of fairness

COMPAS determines its risk scores from answers to a questionnaire that explores a defendant’s criminal history and attitudes about crime. Does this produce biased results?

After ProPublica’s investigation, Northpointe, the company that developed COMPAS, disputed the story, arguing that the journalists misinterpreted the data. So did three criminal-justice researchers, including one from a justice-reform organization. Who’s right—the reporters or the researchers? Krishna Gummadi, head of the Networked Systems Research Group at the Max Planck Institute for Software Systems in Saarbrücken, Germany, offers a surprising answer: they all are.

Gummadi, who has extensively researched fairness in algorithms, says ProPublica’s and Northpointe’s results don’t contradict each other. They differ because they use different measures of fairness.

If used properly, criminal-justice algorithms offer “the chance of a generation, and perhaps a lifetime, to reform sentencing and unwind mass incarceration in a scientific way.”

Imagine you are designing a system to predict which criminals will reoffend. One option is to optimize for “true positives,” meaning that you will identify as many people as possible who are at high risk of comm

Inspecting Algorithms for Bias

The criminal justice system is becoming automated. At every stage — from policing and investigations to bail, evidence, sentencing and parole — computer systems play a role. Artificial intelligence deploys cops on the beat. Audio sensors generate gunshot alerts. Forensic analysts use probabilistic software programs to evaluate fingerprints, faces and DNA. Risk-assessment instruments help to determine who is incarcerated and for how long.

Technological advancement is, in theory, a welcome development. But in practice, aspects of automation are making the justice system less fair for criminal defendants.

The root of the problem is that automated criminal justice technologies are largely privately owned and sold for profit. The developers tend to view their technologies as trade secrets. As a result, they often refuse to disclose details about how their tools work, even to criminal defendants and their attorneys, even under a protective order, even in the controlled context of a criminal proceeding or parole hearing.

Take the case of Glenn Rodríguez. An inmate at the Eastern Correctional Facility in upstate New York, Mr. Rodríguez was denied parole last year despite having a nearly perfect record of rehabilitation. The reason? A high score from a computer system called Compas. The company that makes Compas considers the weighting of inputs to be proprietary information. That forced Mr. Rodríguez to rely on his own ingenuity to figure out what had gone wrong.

This year, Mr. Rodríguez returned to the parole board with the same faulty Compas score. He had identified an error in one of the inputs for his Compas assessment. But without knowing the input weights, he was unable to explain the effect of this error, or persuade anyone to correct it. Instead of challenging the result, he was left to try to argue for parole despite the result.

Mr. Rodríguez was lucky. In the end, he made parole and left Eastern Correctional in mid-May. But had he been able to examine and contest the logic of the Compas system to prove that its score gave a distorted picture of his life, he might have gone home much earlier.

Or consider the case of Billy Ray Johnson, a defendant in California who was sentenced to life without parole for a series of burglaries and sexual assaults that he says he did not commit. The prosecution relied on the results of a software program called TrueAllele that was used to analyze traces of DNA from the crime scenes.

When an expert witness for Mr. Johnson sought to review the TrueAllele source code in order to confront and cross-examine its programmer about how the software works, the developer claimed it was a trade secret, and the court refused to order the code disclosed — even though Mr. Johnson’s attorney offered to sign a protective order that would safeguard the code. Mr. Johnson was thus unable to fully challenge the evidence used to find him guilty.

TrueAllele’s developer maintains this decision was right. It has submitted affidavits to courts across the country alleging that disclosing the program’s source code to defense attorneys would cause “irreperable harm” to the company because it would allow competitors to steal the code. Most judges have credited this claim, quashing defense subpoenas for the source code and citing the company’s intellectual property interests as a rationale.

In 2015, a California Appeals Court upheld a trade secret evidentiary privilege in a criminal proceeding — for what is likely the first time in the nation’s history — to shield TrueAllele source code from disclosure to the defense. That decision, People v. Chubbs, is now being cited across the country to deny defendants access to trade secret evidence.

TrueAllele is not alone. In another case, an organization that produces cybercrime investigative software tried to invoke a trade secret evidentiary privilege to withhold its source code, despite concerns that the program violated the Fourth Amendment by surreptitiously scanning computer hard drives. In still other instances, developers of face recognition technology have refused to disclose the user manuals for their software programs, potentially obstructing defense experts’ ability to evaluate whether a program has been calibrated for certain racial groups and not for others.

Likewise, the algorithms used to generate probabilistic matches for latent fingerprint analysis, and to search ballistic information databases for firearm and cartridge matches, are treated as trade secrets and remain inaccessible to independent auditors.

This is a new and troubling feature of the criminal justice system. Property interests do not usually shield relevant evidence from the accused. And it’s not how trade secrets law is supposed to work, either. The most common explanation for why this form of intellectual property should exist is that people will be more likely to invest in new ideas if they can stop their business competitors from free riding on the results. The law is designed to stop business competitors from stealing confidential commercial information, not to justify withholding information from the defense in criminal proceedings.

Defense advocacy is a keystone of due process, not a business competition. And defense attorneys are officers of the court, not would-be thieves. In civil cases, trade secrets are often disclosed to opposing parties subject to a protective order. The same solution should work for those defending life or liberty.

The Supreme Court is currently considering hearing a case, Wisconsin v. Loomis, that raises similar issues. If it hears the case, the court will have the opportunity to rule on whether it violates due process to sentence someone based on a risk-assessment instrument whose workings are protected as a trade secret. If the court declines the case or rules that this is constitutional, legislatures should step in and pass laws limiting trade-secret safeguards in criminal proceedings to a protective order and nothing more.

The future of the criminal justice system may depend on it.

When a Computer Program Keeps You in Jail

Predicting the future is not only the provenance of fortune tellers or media pundits. Predictive algorithms, based on extensive datasets and statistics have overtaken wholesale and retail operations as any online shopper knows. And in the last few years algorithms, are used to automate decision making for bank loans, school admissions, hiring and infamously in predicting recidivism – the probability that a defendant will commit another crime in the next two years.

COMPAS, which stands for Correctional Offender Management Profiling for Alternative Sanctions, is such a program and was singled out by ProPublica earlier this year as being racially biased. COMPAS utilizes 137 variables in its proprietary and unpublished scoring algorithm; race is not one of those variables. ProPublica used a dataset of defendants in Broward County, Florida. The data included demographics, criminal history, a COMPAS score [1] and the criminal actions in the subsequent two years. ProPublica then crosslinked this data with the defendants’ race. Their findings are generally accepted by all sides

COMPAS is moderately accurate in identifying white and black recidivism about 60% of the time.

COMPAS’s errors reflect apparent racial bias. Blacks are more often wrongly identified as recidivist risks (statistically a false positive) and whites more often erroneously identified as not being a risk (a false negative).

The “penalty” for being misclassified as a higher risk is more likely to be stiffer punishment. Being misclassified as a lower risk is like a “Get out of jail” card.

As you might anticipate, significant controversy followed couched mostly in an academic fight over which statistical measures or calculations more accurately depicted the bias. A study in Science Advances revisits the discussion and comes to a different conclusion.

The study

In the current study, assessment by humans was compared to that of COMPAS using that Broward County dataset. The humans were found on Amazon’s Mechanical Turk [2], paid a dollar to participate and a $5 bonus if they were accurate more than 60% of the time.

Humans were just as accurate as the algorithm (62 vs. 65%)

The errors by the algorithm and humans were identical, overpredicting (false positives) recidivism for black defendants and underpredicting (false negatives) for white defendants.

Humans COMPAS Black % White % Black % White % Accuracy* 68.2 67.6 64.9 65.7 False Positives 37.1 27.2 40.4 25.4 False Negatives 29.2 40.3 30.9 47.9

  • Accuracy is the sum of true positives and true negatives statistically speaking

The human assessors used only seven variables, not the 137 of COMPAS. [3] This suggests that the algorithm was needlessly complex at least in deciding recidivism risk. In fact, the researchers found that just two variables, defendant age, and the number of prior convictions was as accurate as COMPAS’s predictions.

Of more significant interest is the finding that when human assessors were given the additional information regarding defendant race it had no impact. They were just as accurate and demonstrated the same racial disparity in false positives and negatives. Race was an associated confounder, but it was not the cause of the statistical difference. ProPublica’s narrative of racial bias was incorrect.

Algorithms are statistical models involving choices. If you optimize to find all the true positives, your false positives will increase. Lower your false positive rate and the false negatives increase. Do we want to incarcerate more or less? The MIT Technology Review puts it this way.

“Are we primarily interested in taking as few chances as possible that someone will skip bail or re-offend? What trade-offs should we make to ensure justice and lower the massive social costs of imprisonment?”

COMPAS is meant to serve as a decision aid. The purpose of the 137 variables is to create a variety of scales depicting substance abuse, environment, criminal opportunity, associates, etc. Its role is to assist the humans of our justice system in determining an appropriate punishment. [4] None of the studies, to my knowledge, looked at the sentences handed down. The current research ends as follows:

“When considering using software such as COMPAS in making decisions that will significantly affect the lives and well-being of criminal defendants, it is valuable to ask whether we would put these decisions in the hands of random people who respond to an online survey because, in the end, the results from these two approaches appear to be indistinguishable.”

The answer is no. COMPAS and similar algorithms are tools, not a replacement FOR human judgment. They facilitate but do not automate. ProPublica is correct when saying algorithmic decisions need to be understood by their human users and require continuous validation and refinement. But ProPublica’s narrative, that evil forces were responsible for a racially biased algorithm are not true.

[1] COMPAS is scored on a 1 to 10 scale, with scores greate

ProPublica Is Wrong In Charging Racial Bias In An Algorithm

Our most sophisticated crime-predicting algorithms may not be as good as we thought. A study published today in Science Advances takes a look at the popular COMPAS algorithm — used to assess the likelihood that a given defendant will reoffend — and finds the algorithm is no more accurate than the average person’s guess. If the findings hold, they would be a black eye for sentencing algorithms in general, indicating we may simply not have the tools to accurately predict whether a defendant will commit further crimes.

Developed by Equivant (formerly Northpointe), the COMPAS algorithm examines a defendant’s criminal record alongside a series of other factors to assess how likely they are to be arrested again in the next two years. COMPAS’ risk assessment can then inform a judge’s decisions about bail or even sentencing. If the algorithm is inaccurate, the result could be a longer sentence for an otherwise low-risk defendant, a significant harm for anyone impacted.

Reached by The Verge, Equivant contested the accuracy of the paper in a lengthy statement, calling the work “highly misleading.”

“The ceiling in predictive power is lower than I had thought”

COMPAS has been criticized by ProPublica for racial bias (a claim some statisticians dispute), but the new paper, from Hany Farid and Julia Dressel of Dartmouth, tackles a more fundamental question: are COMPAS’ predictions any good? Drawing on ProPublica’s data, Farid and Dressel found the algorithm predicted reoffenses roughly 65 percent of the time — a low bar, given that roughly 45 percent of defendants reoffend.

In its statement, however, Equivant argues it has cleared the 70 percent AUC standard for risk assessment tools.

The most surprising results came when researchers compared COMPAS to other kinds of prediction. Farid and Dressel recruited 462 random workers through Amazon’s Mechanical Turk platform, and asked the Turkers to “read a few sentences about an actual person and predict if they will commit a crime in the future.” They were paid one dollar for completing the task, with a five dollar bonus if their accuracy was over 65 percent. Surprisingly, the median Turker ended up two points better than COMPAS, clocking in at 67 percent accuracy.

George Mason University law professor Megan Stevenson has done similarly pessimistic research on risk assessment programs in Kentucky, and says she was surprised by just how bad the finding was for COMPAS. The sample size is small so it’s hard to be sure COMPAS’ disadvantage will hold up in further testing, but it’s damning enough that COMPAS is in the same general range as such an ad hoc system.

“The paper definitely has me thinking that the ceiling in predictive power is lower than I had thought,” Stevenson told The Verge, “and I didn’t think it was that high to begin with.”

The researchers also edged out COMPAS with a simpler linear algorithm, which looked only at a defendant’s age and criminal record. That algorithm also out-performed COMPAS, a finding that surprised even the researchers, given the 137 factors involved in a COMPAS assessment. “We typically would expect that as we add more data to a classifier and / or increase the complexity of the classifier, that the classification accuracy would improve,” Farid told The Verge. “We found this not to be the case.”

Equivant challenged this finding as well, arguing the small data sample had led the researchers to over-fit their algorithm. Furthermore, the company downplayed the number of different factors that actually determine a given risk assessment. “In fact, the vast number of these 137 are needs factors and are not used as predictors in the COMPAS risk assessment,” the company said. “The COMPAS risk assessment has six inputs only.”

Risk assessment scores have become an increasingly common feature of the US Justice System, with similar products often used for decisions about pre-trial detainment. Controversially, the specific details of the algorithm are often treated as a trade secret, making it difficult for lawyers to contest the results. Last year, the Supreme Court declined to hear a case challenging the legality of the COMPAS system, which argued that keeping the algorithm secret violated the defendant’s constitutional rights.

Notably, both systems maintained roughly the same bias profile as COMPAS, maintaining predictive parity across races but distributing error disproportionately, with false positives more likely to occur among black defendants.

The study’s biggest weakness is the data itself. Court records are notoriously messy, and the data is drawn from just two years in a specific county, which could limit its predictive power. Recidivism studies also face a long-standing problem in reliably measuring false positives, since a longer prison sentence can prevent a person from reoffending while they’re incarcerated.

Still, researchers expect plenty of confirmation studies are already underway. “My guess is that there will be a slew of papers coming

Mechanical Turkers out-predicted COMPAS, a major judicial algorithm

Caution is indeed warranted, according to Julia Dressel and Hany Farid from Dartmouth College. In a new study, they have shown that COMPAS is no better at predicting an individual’s risk of recidivism than random volunteers recruited from the internet.

“Imagine you’re a judge and your court has purchased this software; the people behind it say they have big data and algorithms, and their software says the defendant is high-risk,” says Farid. “Now imagine I said: Hey, I asked 20 random people online if this person will recidivate and they said yes. How would you weight those two pieces of data? I bet you’d weight them differently. But what we’ve shown should give the courts some pause.” (A spokesperson from Equivant declined a request for an interview.)

COMPAS has attracted controversy before. In 2016, the technology reporter Julia Angwin and colleagues at ProPublica analyzed COMPAS assessments for more than 7,000 arrestees in Broward County, Florida, and published an investigation claiming that the algorithm was biased against African Americans. The problems, they said, lay in the algorithm’s mistakes. “Blacks are almost twice as likely as whites to be labeled a higher risk but not actually re-offend,” the team wrote. And COMPAS “makes the opposite mistake among whites: They are much more likely than blacks to be labeled lower-risk but go on to commit other crimes.”

Northpointe questioned ProPublica’s analysis, as did various academics. They noted, among other rebuttals, that the program correctly predicted recidivism in both white and black defendants at similar rates. For any given score on COMPAS’s 10-point scale, white and black people are just as likely to re-offend as each other. Others have noted that this debate hinges on one’s definition of fairness, and that it’s mathematically impossible to satisfy the standards set by both Northpointe and ProPublica—a story at The Washington Post clearly explains why.

The debate continues, but when Dressel read about it, she realized that it masked a different problem. “There was this underlying assumption in the conversation that the algorithm’s predictions were inherently better than human ones,” she says, “but I couldn’t find any research proving that.” So she and Farid did their own.

They recruited 400 volunteers through a crowdsourcing site. Each person saw short descriptions of defendants from ProPublica’s investigation, highlighting seven pieces of information. Based on that, they had to guess if the defendant would commit another crime within two years.

On average, they got the right answer 63 percent of their time, and the group’s accuracy rose to 67 percent if their answers were pooled. COMPAS, by contrast, has an accuracy of 65 percent. It’s barely better than individual guessers, and no better than a crowd. “These are nonexperts, responding to an online survey with a fraction of the amount of information that the software has,” says Farid. “So what exactly is software like COMPAS doing?”

A Popular Algorithm Is No Better at Predicting Crimes Than Random People

IN AMERICA, computers have been used to assist bail and sentencing decisions for many years. Their proponents argue that the rigorous logic of an algorithm, trained with a vast amount of data, can make judgments about whether a convict will reoffend that are unclouded by human bias. Two researchers have now put one such program, COMPAS, to the test. According to their study, published in Science Advances, COMPAS did neither better nor worse than people with no special expertise.

Julia Dressel and Hany Farid of Dartmouth College in New Hampshire selected 1,000 defendants at random from a database of 7,214 people arrested in Broward County, Florida between 2013 and 2014, who had been subject to COMPAS analysis. They split their sample into 20 groups of 50. For each defendant they created a short description that included sex, age and prior convictions, as well as the criminal charge faced.

Get our daily newsletter Upgrade your inbox and get our Daily Dispatch and Editor's Picks.

They then turned to Amazon Mechanical Turk, a website which recruits volunteers to carry out small tasks in exchange for cash. They asked 400 such volunteers to predict, on the basis of the descriptions, whether a particular defendant would be arrested for another crime within two years of his arraignment (excluding any jail time he might have served)—a fact now known because of the passage of time. Each volunteer saw only one group of 50 people, and each group was seen by 20 volunteers. When Ms Dressel and Dr Farid crunched the numbers, they found that the volunteers correctly predicted whether someone had been rearrested 62.1% of the time. When the judgments of the 20 who examined a particular defendant’s case were pooled, this rose to 67%. COMPAS had scored 65.2%—essentially the same as the human volunteers.

To see whether mention of a person’s race (a thorny issue in the American criminal-justice system) would affect such judgments, Ms Dressel and Dr Farid recruited 400 more volunteers and repeated their experiment, this time adding each defendant’s race to the description. It made no difference. Participants identified those rearrested with 66.5% accuracy.

All this suggests that COMPAS, though not perfect, is indeed as good as human common sense at parsing pertinent facts to predict who will and will not come to the law’s attention again. That is encouraging. Whether it is good value, though, is a different question, for Ms Dressel and Dr Farid have devised an algorithm of their own that was as accurate as COMPAS in predicting rearrest when fed the Broward County data, but which involves only two inputs—the defendant’s age and number of prior convictions.

As Tim Brennan, chief scientist at Equivant, which makes COMPAS, points out, the researchers’ algorithm, having been trained and tested on data from one and the same place, might prove less accurate if faced with records from elsewhere. But so long as the algorithm behind COMPAS itself remains proprietary, a detailed comparison of the virtues of the two is not possible.

Are programs better than people at predicting reoffending?

Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. These predictions are used in pretrial, parole, and sentencing decisions. Proponents of these systems argue that big data and advanced machine learning make these analyses more accurate and less biased than humans. We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. In addition, despite COMPAS’s collection of 137 features, the same accuracy can be achieved with a simple linear classifier with only two features.

The accuracy, fairness, and limits of predicting recidivism

Program used to assess more than a million US defendants may not be accurate enough for potentially life-changing decisions, say experts

The credibility of a computer program used for bail and sentencing decisions has been called into question after it was found to be no more accurate at predicting the risk of reoffending than people with no criminal justice experience provided with only the defendant’s age, sex and criminal history.

The algorithm, called Compas (Correctional Offender Management Profiling for Alternative Sanctions), is used throughout the US to weigh up whether defendants awaiting trial or sentencing are at too much risk of reoffending to be released on bail.

Since being developed in 1998, the tool is reported to have been used to assess more than one million defendants. But a new paper has cast doubt on whether the software’s predictions are sufficiently accurate to justify its use in potentially life-changing decisions.

The algorithms that are already changing your life Read more

Hany Farid, a co-author of the paper and professor of computer science at Dartmouth College in New Hampshire, said: “The cost of being wrong is very high and at this point there’s a serious question over whether it should have any part in these decisions.”

The analysis comes as courts and police forces internationally are increasingly relying on computerised approaches to predict the likelihood of people reoffending and to identify potential crime hotspots where police resources should be concentrated. In the UK, East Midlands police force are trialling software called Valcri, aimed at generating plausible ideas about how, when and why a crime was committed as well as who did it, and Kent Police have been using predictive crime mapping software called PredPol since 2013.

The trend has raised concerns about whether such tools could introduce new forms of bias into the criminal justice system, as well as questions about the regulation of algorithms to ensure the decisions they reach are fair and transparent.

The latest analysis focuses on the more basic question of accuracy.

Farid, with colleague Julia Dressel, compared the ability of the software – which combines 137 measures for each individual – against that of untrained workers, contracted through Amazon’s Mechanical Turk online crowd-sourcing marketplace.

The academics used a database of more than 7,000 pretrial defendants from Broward County, Florida, which included individual demographic information, age, sex, criminal history and arrest record in the two year period following the Compas scoring.

The online workers were given short descriptions that included a defendant’s sex, age, and previous criminal history and asked whether they thought they would reoffend. Using far less information than Compas (seven variables versus 137), when the results were pooled the humans were accurate in 67% of cases, compared to the 65% accuracy of Compas.

In a second analysis, the paper found that Compas’s accuracy at predicting recidivism could also be matched using a simple calculation involving only an offender’s age and the number of prior convictions.

“When you boil down what the software is actually doing, it comes down to two things: your age and number of prior convictions,” said Farid. “If you are young and have a lot of prior convictions you are high risk.”

“As we peel the curtain away on these proprietary algorithms, the details of which are closely guarded, it doesn’t look that impressive,” he added. “It doesn’t mean we shouldn’t use it, but judges and courts and prosecutors should understand what is behind this.”

Seena Fazel, a professor of forensic psychiatry at the University of Oxford, agreed that the inner workings of such risk assessment tools ought to be made public so that they can be scrutinised.

However, he said that in practice, such algorithms were not used to provide a “yes or no” answer, but were useful in giving gradations of risk and highlighting areas of vulnerability – for instance, recommending that a person be assigned a drug support worker on release from prison.

“I don’t think you can say these algorithms have no value,” he said. “There’s lots of other evidence suggesting they are useful.”

Rise of the racist robots – how AI is learning all our worst impulses Read more

The paper also highlights the potential for racial asymmetries in the outputs of such software that can be difficult to avoid – even if the software itself is unbiased.

The analysis showed that while the accuracy of the software was the same for black and white defendants, the so-called false positive rate (when someone who does not go on to offend is classified as high risk) was higher for black than for white defendants. This kind of asymmetry is mathematically inevitable in the case where two populations have a different underlying rate of reoffending – in the Florida data set the black defendants were more likely to reoffend – but such disparities nonetheless raise thorny q

Software 'no more accurate than untrained humans' at judging reoffending risk

Algorithms that assess people’s likelihood to reoffend as part of the bail-setting process in criminal cases are, to be frank, really scary.

We don’t know very much about how they work—the companies that make them are intensely secretive about what makes their products tick—and studies have suggested that they can harbor racial prejudices. Yet, these algorithms provide judges with information that is used to decide the course of somebody’s life.

Now, a new study published on Wednesday in Science Advances from Dartmouth College computer science professor Hany Farid and former student Julia Dressel claims to “cast significant doubt on the entire effort of algorithmic recidivism prediction,” the authors write. In short, bail algorithms don’t appear to perform any better than human beings.

According to their study, COMPAS—one of the most popular algorithms used by courts in the US and elsewhere to predict recidivism—is no more accurate than 20 people asked to estimate the likelihood of recidivism in an online survey. Additionally, COMPAS didn’t outperform a simple linear predictor algorithm armed with just two inputs: age and number of crimes committed. COMPAS, in contrast, uses 137 unique inputs to make decisions, the study's authors write.

In a statement released after the study was published, Equivant—the company behind COMPAS—argued that COMPAS in fact only uses six inputs, and that the rest are "needs factors that are NOT used as predictors in the COMPAS risk assessment." In response, the authors wrote to me in an email that "regardless how many features are used by COMPAS, the fact is that a simple predictor with only two features and people responding to an online survey are as accurate as COMPAS."

“Our point isn’t that it's good or bad,” said co-author Farid over the phone. “But we would like the courts to understand that the weight they give these risk assessments should be based on an understanding that the accuracy from this commercial black box software is exactly the same as asking a bunch of people to respond to an online survey.”

The baseline accuracy of online respondents estimating recidivism within two years was 63 percent, the authors report, while COMPAS’ is 65 percent (a finding based on a dataset covering its use in Broward County, Florida, between 2013 and 2014). The simple linear algorithm with just two inputs had an accuracy of 66 percent. It’s worth noting that many researchers prefer to gauge accuracy with a different statistical measure known as AUC-ROC—even using this measure, though, online survey respondents managed an AUC-ROC value of .71, while COMPAS achieves .70.

"The findings of 'virtually equal predictive accuracy' in this study, instead of being a criticism of the COMPAS assessment," Equivant wrote in an online statement, "actually adds to a growing number of independent studies that have confirmed that COMPAS achieves good predictability and matches the increasingly accepted AUC standard of 0.70 for well-designed risk assessment tools used in criminal justice."

In response, the authors wrote me that .70 AUC is indeed the industry standard, but noted that their study participants nonetheless managed .71. "Therefore, regardless of the preferred measure of predictive performance, COMPAS and the human participants are indistinguishable," they wrote.

According to the study’s authors, their work suggests a cap on the accuracy of predictions about people’s futures based on historical data, whether the predictions are made by people or machines. Indeed, the whole idea of predicting someone’s behaviour two years from the present may be wrongheaded, Fahid said. Regardless, the overall point is these automated techniques are not any better than humans.

A potential caveat, however: According to Sam Corbett-Davies—a Stanford PhD student who has done research on the risks posed by bail algorithms—predictions based solely on select historical data (whether it’s done by algorithms or not) are often still more accurate than those that include more subjective factors like how a judge feels about tattoos.

"Judges are exposed to much more information: they can talk to defendants, assess their demeanor, see their tattoos, and ask about their upbringing or family life,” Corbett-Davies wrote me in an email. “All these extra factors are mostly useless, but they allow human biases to seep into judges' decisions. Multiple studies have looked at thousands of judge decisions and found that algorithms based on very few factors can significantly outperform judges."

In other words, human "intuition" based on a grab bag of subjective factors may still be less accurate than algorithms (or even humans) just looking at select historical information about a person.

Still, Fahid and Dressel’s findings are, at the very least, an indictment of how companies armed with flashy advertising and a staunch refusal to reveal their secret sauce have managed to flood the criminal justice system with algorithms that help to decide people’s futures without publicly vetted evidence of their accuracy.

Indeed, study co-author Julia Dressel told me over the phone, the last published study that specifically compared the accuracy of algorithms versus that of humans in predicting recidivism (that they could find, anyway) was done in Canada in 1984. A few things have changed since then.

“Companies should have to prove that these algorithms are actually accurate and effective,” Dressel told me over the phone. “I think the main step forward is recognizing that we need to be a bit wary of machine learning and artificial intelligence. And though these words sound impressive, and they can do really great things, we have to hold these technologies to a high standard.”

UPDATE: Equivant initially did not respond to Motherboard's request for comment, but after publication released a statement that criticized the study published in Science Advances by Hany Farid and Julia Dressel. The company claimed that the researchers misstated the number of inputs COMPAS uses, and questioned their methodology. We asked Equivant for more details, but it declined. The story has been updated with Equivant's response and additional comments from the authors defending their work.

Bail Algorithms Are As Accurate As Random People Doing an Online Survey

A widely-used computer software tool may be no more accurate or fair at predicting repeat criminal behavior than people with no criminal justice experience, according to a Dartmouth College study.

The Dartmouth analysis showed that non-experts who responded to an online survey performed equally as well as the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) software system used by courts to help determine the risk of recidivism.

The paper also demonstrates that although COMPAS uses over one hundred pieces of information to make a prediction, the same level of accuracy may be achieved with only two variables -- a defendant's age and number of prior convictions.

According to the research paper, COMPAS has been used to assess over one million offenders since it was developed in 1998, with its recidivism prediction component in use since 2000.

The analysis, published in the journal Science Advances, was carried out by the student-faculty research team of Julia Dressel and Hany Farid.

"It is troubling that untrained internet workers can perform as well as a computer program used to make life-altering decisions about criminal defendants," said Farid, the Albert Bradley 1915 Third Century Professor of Computer Science at Dartmouth College. "The use of such software may be doing nothing to help people who could be denied a second chance by black-box algorithms."

According to the paper, software tools are used in pretrial, parole, and sentencing decisions to predict criminal behavior, including who is likely to fail to appear at a court hearing and who is likely to reoffend at some point in the future. Supporters of such systems argue that big data and advanced machine learning make these analyses more accurate and less biased than predictions made by humans.

"Claims that secretive and seemingly sophisticated data tools are more accurate and fair than humans are simply not supported by our research findings," said Dressel, who performed the research as part of her undergraduate thesis in computer science at Dartmouth.

The research paper compares the commercial COMPAS software against workers contracted through Amazon's online Mechanical Turk crowd-sourcing marketplace to see which approach is more accurate and fair when judging the possibility of recidivism. For the purposes of the study, recidivism was defined as committing a misdemeanor or felony within two years of a defendant's last arrest.

Groups of internet workers saw short descriptions that included a defendant's sex, age, and previous criminal history. The human results were then compared to results from the COMPAS system that utilizes 137 variables for each individual.

Overall accuracy was based on the rate at which a defendant was correctly predicted to recidivate or not. The research also reported on false positives -- when a defendant is predicted to recidivate but doesn't -- and false negatives -- when a defendant is predicted not to recidivate but does.

With considerably less information than COMPAS -- seven features compared to 137 -- when results were pooled to determine the "wisdom of the crowd," the humans with no presumed criminal justice experience were accurate in 67 percent of the cases presented, statistically the same as the 65.2 percent accuracy of COMPAS. Study participants and COMPAS were in agreement for 69.2 percent of the 1000 defendants when predicting who would repeat their crimes.

According to the study, the question of accurate prediction of recidivism is not limited to COMPAS. A separate review cited in the study found that eight of nine software programs failed to make accurate predictions.

"The entire use of recidivism prediction instruments in courtrooms should be called into question," Dressel said. "Along with previous work on the fairness of criminal justice algorithms, these combined results cast significant doubt on the entire effort of predicting recidivism."

In contrast to other analyses that focus on whether algorithms are racially biased, the Dartmouth study considers the more fundamental issue of whether the COMPAS algorithm is any better than untrained humans at predicting recidivism in an accurate and fair way.

However, when race was considered, the research found that results from both the human respondents and the software showed significant disparities between how black and white defendants are judged.

According to the paper, it is valuable to ask if we would put these decisions in the hands of untrained people who respond to an online survey, because, in the end, "the results from these two approaches appear to be indistinguishable."

Court software may be no more accurate than web survey takers in predicting criminal risk

The American criminal justice system couldn’t get much less fair. Across the country, some 1.5 million people are locked up in state and federal prisons. More than 600,000 people, the vast majority of whom have yet to be convicted of a crime, sit behind bars in local jails. Black people make up 40 percent of those incarcerated, despite accounting for just 13 percent of the US population.

With the size and cost of jails and prisons rising—not to mention the inherent injustice of the system—cities and states across the country have been lured by tech tools that promise to predict whether someone might commit a crime. These so-called risk assessment algorithms, currently used in states from California to New Jersey, crunch data about a defendant’s history—things like age, gender, and prior convictions—to help courts decide who gets bail, who goes to jail, and who goes free.

But as local governments adopt these tools, and lean on them to inform life-altering decisions, a fundamental question remains: What if these algorithms aren’t actually any better at predicting crime than humans are? What if recidivism isn’t actually that predictable at all?

That’s the question that Dartmouth College researchers Julia Dressel and Hany Farid set out to answer in a new paper published today in the journal Science Advances. They found that one popular risk-assessment algorithm, called Compas, predicts recidivism about as well as a random online poll of people who have no criminal justice training at all.

"There was essentially no difference between people responding to an online survey for a buck and this commercial software being used in the courts," says Farid, who teaches computer science at Dartmouth. "If this software is only as accurate as untrained people responding to an online survey, I think the courts should consider that when trying to decide how much weight to put on them in making decisions."

Man Vs Machine

While she was still a student at Dartmouth majoring in computer science and gender studies, Dressel came across a ProPublica investigation that showed just how biased these algorithms can be. That report analyzed Compas's predictions for some 7,000 defendants in Broward County, Florida, and found that the algorithm was more likely to incorrectly categorize black defendants as having a high risk of reoffending. It was also more likely to incorrectly categorize white defendants as low risk.

That was alarming enough. But Dressel also couldn't seem to find any research that studied whether these algorithms actually improved on human assessments.

'There was essentially no difference between people responding to an online survey for a buck and this commercial software being used in the courts.' Hany Farid, Dartmouth College

"Underlying the whole conversation about algorithms was this assumption that algorithmic prediction was inherently superior to human prediction," she says. But little proof backed up that assumption; this nascent industry is notoriously secretive about developing these models. So Dressel and her professor, Farid, designed an experiment to test Compas on their own.

Using Amazon Mechanical Turk, an online marketplace where people get paid small amounts to complete simple tasks, the researchers asked about 400 participants to decide whether a given defendant was likely to reoffend based on just seven pieces of data, not including that person's race. The sample included 1,000 real defendants from Broward County, because ProPublica had already made its data on those people, as well as information on whether they did in fact reoffend, public.

They divided the participants into groups, so that each turk assessed 50 defendants, and gave the following brief description:

The defendant is a [SEX] aged [AGE]. They have been charged with: [CRIME CHARGE]. This crime is classified as a [CRIMI- NAL DEGREE]. They have been convicted of [NON-JUVENILE PRIOR COUNT] prior crimes. They have [JUVENILE- FELONY COUNT] juvenile felony charges and [JUVENILE-MISDEMEANOR COUNT] juvenile misdemeanor charges on their record.

That's just seven data points, compared to the 137 that Compas amasses through its defendant questionnaire. In a statement, Equivant says it only uses six of those data points to make its predictions. Still, these untrained online workers were roughly as accurate in their predictions as Compas.

Overall, the turks predicted recidivism with 67 percent accuracy, compared to Compas' 65 percent. Even without access to a defendant's race, they also incorrectly predicted that black defendants would reoffend more often than they incorrectly predicted white defendants would reoffend, known as a false positive rate. That indicates that even when racial data isn't available, certain data points—like number of convictions—can become proxies for race, a central issue with eradicating bias in these algorithms. The Dartmouth researchers' false positive rate for black defendants was 37 percent, compared to 27 percent for

Crime-Predicting Algorithms May Not Fare Much Better Than Untrained Humans

Just like a professional chef or a heart surgeon, a machine learning algorithm is only as good as the training it receives. And as algorithms increasingly take the reigns and make decisions for humans, we’re finding out that a lot of them didn’t receive the finest education, as they mimic human race- and gender-based biases and even create new problems.

For these reasons, it’s particularly concerning that multiple states, including California, New York, and Wisconsin, use algorithms to predict which people will commit crimes again after they’ve been incarcerated. Even worse, it doesn’t even seem to work.

In a paper published Wednesday in the journal Science Advances, a pair of computer scientists at Dartmouth College found that a widely used computer program for predicting recidivism is no more accurate than completely untrained civilians. This program, called Correctional Offender Management Profiling for Alternative Sanctions, analyzes 137 different factors to determine how likely it is that a person will commit another crime after release. COMPAS considers factors like substance use, social isolation, and other elements that criminologists theorize can lead to recidivism, ranking people as high, medium, or low risk.

Machine learning algorithms that gauge incarcerated people's risk of recidivism are deeply flawed, say researchers.

And sure, risk assessment sounds great. Why not have more data to help courts determine who is a greater risk? But what Dartmouth computer scientists Julia Dressel and Hany Farid found was that untrained individuals correctly judged recidivism risk with just about the same accuracy as COMPAS, suggesting that the supposed power of the algorithm isn’t actually there.

In one trial that included just a fraction of the information used by COMPAS (seven factors instead of 137, and excluding race), a group of human volunteers on the internet, with presumably no training in criminal risk assessment, evaluated case reports. They correctly estimated a person’s recidivism with 67 percent accuracy, compared to COMPAS’s 65 percent accuracy.

Take a moment to let that sink in. Untrained people on the web were slightly better at predicting whether a person would go back to jail than the tool that is literally designed to predict whether a person would go back to jail. And it gets worse. Once you add a defendant’s race, the volunteer’s false-positive and false-negative rates were within just a few percentage points of COMPAS’s. So not only is COMPAS not that great at predicting recidivism, it’s just as prone to racial bias as humans are. So much for the cold logic of computers.

The researchers found that humans were almost as good as the algorithm at predicting recidivism rates. They also found that humans and the algorithm had similar rates of false-positives and false-negatives when race is factored in.

The researchers then made a linear model that matched COMPAS’s prediction rate with just two factors: age and number of previous convictions. Just to be clear, this prediction would also be unfair, but it demonstrates just how flawed COMPAS is.

And while this research is new, the big takeaways it espouses are not. In a 2016 investigation, ProPublica reporters found that not only is COMPAS unreliable, it’s actually systematically biased against African Americans, consistently rating black people as higher risk than whites who committed more serious crimes. Hopefully, this new research will help pave the way for juster risk assessment processes in the criminal justice system.

The fact that COMPAS is useless at best and deeply biased at worst suggests that computer-based risk assessments could be deepening the injustices that the justice system is supposed to address. Since risk assessment scores can be applied at any step of the criminal justice process, including while setting a person’s bond, determining whether they’re granted parole, and in some states, even for determining a person’s sentence, this research suggests a dire need to reexamine the use of COMPAS and other programs.

Common Computer Program Predicts Recidivism as Poorly as Humans

Predicting Recidivism

Recidivism is the likelihood of a person convicted of a crime to offend again. Currently, this rate is determined by predictive algorithms. The outcome can affect everything from sentencing decisions to whether or not a person receives parole.

To determine how accurate these algorithms actually are in practice, a team led by Dartmouth College researchers Julia Dressel and Hany Farid conducted a study of a widely-used commercial risk assessment software known as Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). The software determines whether or not a person will re-offend within two years following their conviction.

The study revealed that COMPAS is no more accurate than a group of volunteers with no criminal justice experience at predicting recidivism rates. Dressel and Farid crowdsourced a list of volunteers from a website, then randomly assigned them small lists of defendants. The volunteers were told each defendant’s sex, age, and previous criminal history then asked to predict whether they would re-offend within the next two years.

The accuracy of the human volunteer’s predictions included a mean of 62.1 percent and a median of 64.0 percent — very close to COMPAS’ accuracy, which is 65.2 percent.

Additionally, researchers found that even though COMPAS has 137 features, linear predictors with just two features (the defendant’s age and their number of previous convictions) worked just as well for predicting recidivism rates.

The Problem of Bias

One area of concern for the team was the potential for algorithmic bias. In their study, both human volunteers and COMPAS exhibited similar false positive rates when predicting recidivism for black defendants — even though they didn’t know the defendant’s race when they were making their predictions. The false positive rate for black defendants was 37 percent, whereas it was 27 percent for white defendants. These rates were fairly close to those from COMPAS: 40 percent for black defendants and 25 percent for white defendants.

In the paper’s discussion, the team pointed out that “differences in the arrest rate of black and white defendants complicate the direct comparison of false-positive and false-negative rates across race.” This is backed up by NAACP data which, for example, has found that “African Americans and whites use drugs at similar rates, but the imprisonment rate of African Americans for drug charges is almost 6 times that of whites.”

The authors noted that even though a person’s race was not explicitly stated, certain aspects of the data could potentially correlate to race, leading to disparities in the results. In fact, when the team repeated the study with new participants and did provide racial data, the results were about the same. The team concluded that “the exclusion of race does not necessarily lead to the elimination of racial disparities in human recidivism prediction.”

Repeated Results

COMPAS has been used to evaluate over 1 million people since it was developed in 1998 (though its recidivism prediction component wasn’t included until 2000). With that context in mind, the study’s findings — that a group of untrained volunteers with little to no experience in criminal justice perform on par with the algorithm — were alarming.

The obvious conclusion would be that the predictive algorithm is simply not sophisticated enough and is long overdue to be updated. However, when the team was ready to validate their findings, they trained a more powerful nonlinear support vector machine (NL-SVM) with the same data. When it produced very similar results, the team faced backlash, as it was assumed they had trained the new algorithm too closely to the data.

Dressel and Farid said they specifically trained the algorithm on 80 percent of the data, then ran their tests on the remaining 20 percent in order to avoid so-called “over-fitting” — when an algorithm’s accuracy is affected because it’s become too familiar with the data.

Predictive Algorithms

The researchers concluded that perhaps the data in question is not linearly separable, which could mean that predictive algorithms, no matter how sophisticated, are simply not an effective method for predicting recidivism. Considering that defendants’ futures hang in the balance, the team at Dartmouth asserted that the use of such algorithms to make these determinations should be carefully considered.

As they stated in the study’s discussion, the results of their study show that to rely on an algorithm for that assessment is no different than putting the decision “in the hands of random people who respond to an online survey because, in the end, the results from these two approaches appear to be indistinguishable.”

“Imagine you’re a judge, and you have a commercial piece of software that says we have big data, and it says this person is high risk,” Farid told Wired, “Now imagine I tell you I asked 10 people online the same question, and this is what they said. You’d weigh those things differently.”

Predictive algorithms aren’t just used in the criminal justice system. In fact, we encounter them every day: from products advertised to us online to music recommendations on streaming services. But an ad popping up in our newsfeed is of far less consequence than the decision to convict someone of a crime.

Algorithms Are No Better at Predicting Repeat Offenders Than Inexperienced Humans

In a study published Wednesday, a pair of Dartmouth researchers found that a popular risk assessment algorithm was no better at predicting a criminal offender's likelihood of reoffending than an internet survey of humans with little or no relevant experience.

Photo: AP

The study compared the crime-predicting powers of an algorithm called COMPAS, already used by multiple states, to those of Amazon's Mechanical Turk, a sort of micro TaskRabbit where people are paid to complete small assignments. Using an online poll, the researcher asked "turks" to predict recidivism based on a few scant facts about offenders.

Given the sex, age, crime charge, criminal degree, and prior convictions in juvenile, felony and misdemeanour courts of 50 offenders, each of the 400 survey takers had to assess their likelihood of reoffending. The Dartmouth researchers had information on whether the offenders in question actually did reoffend.

In the end, the authors of the study found that the risk assessment algorithm was no more accurate than people without criminal justice experience. From Wired:

Overall, the turks predicted recidivism with 67 per cent accuracy, compared to Compas' 65 per cent. Even without access to a defendant's race, they also incorrectly predicted that black defendants would reoffend more often than they incorrectly predicted white defendants would reoffend, known as a false positive rate. That indicates that even when racial data isn't available, certain data points — like number of convictions — can become proxies for race, a central issue with eradicating bias in these algorithms.

The high number of false positives is telling. Even without knowing a given defendant's race, black defendants were erroneously believed to be more likely to offend more frequently. While it's wildly unethical to explicitly include race as a factor in likelihood of reoffending, race nonetheless colours each data point. Racial segregation, for example, impacts where offenders live and go to school.

If a school is under-served (as are a number of schools in minority neighbourhoods), that impacts students' education level and thus their income and, more broadly speaking, their opportunities in life. There's no variable for race specifically, but it broadly affects each factor that goes into calculating recidivism.

Whether using humans or machines, there's no real way to extricate race from any of the indicators of crime. The problem is when the reality of bias becomes concealed behind an algorithmic veneer of objectivity. We don't expect supposedly impartial machines to repeat human biases and, as a result, those biases become invisible.

"Underlying the whole conversation about algorithms was this assumption that algorithmic prediction was inherently superior to human prediction," Julia Dressel, the paper's co-author, told Wired.

In a statement, the company that makes COMPAS claimed the study only confirmed "the valid performance of the COMPAS risk model," writing, "The findings of 'virtually equal predictive accuracy' in this study, instead of being a criticism of the COMPAS assessment, actually adds to a growing number of independent studies that have confirmed that COMPAS achieves good predictability and matches the increasingly accepted AUC standard of 0.70 for well-designed risk assessment tools used in criminal justice."

So what actually correlates with recidivism? It's startlingly simple: age and prior convictions. Older people were less likely to get into trouble again; younger people, more.

[Wired]

Study Finds Crime-Predicting Algorithm Is No Smarter Than Online Poll Takers

Receive emails about upcoming NOVA programs and related content, as well as featured reporting about current events through a science lens. Email Address Zip Code Subscribe

An “unbiased” computer algorithm used for informing judicial decisions appears to be no better than the assessments of a random group of people, according to a recent study. What’s more, the algorithm appears to issue racially biased recommendations.

COMPAS, the software that many judges use to inform their sentencing decisions, tends to classify black people as higher risk and white people as lower risk, despite not including explicit information about race. In practice, this translates into more lenient rehabilitation suggestions for white defendant and more rigorous programs black defendant of the same recidivism risk.

Support Provided By Learn More

A study of 1,000 defendants suggests that an algorithm used by judges in several states may be wrong a third of the time.

The results—and the bias—were statistically indistinguishable from judgement calls made by human volunteers randomly selected over the internet.

Here’s John Timmer writing at Ars Technica:

The significance of that discrepancy is still the subject of some debate, but two Dartmouth College researchers have asked a more fundamental question: is the software any good? The answer they came up with is “not especially,” as its performance could be matched by recruiting people on Mechanical Turk or performing a simple analysis that only took two factors into account.

The racial bias likely creeps into the COMPAS algorithm through data on arrest rates, which in some cities and counties are skewed. Equivalent, the developer of COMPAS, says it relies on 127 different data points when determining rehabilitation programs, but only six when assessing whether an individual is at risk of reoffending.

The new study builds on an investigation by Pro Publica, which analyzed COMPAS’s performance in Broward County, Florida, between 2013–2014. Researchers at Dartmouth College took data on age, sex, and criminal history for 1,000 defendants and handed it to volunteer “judges” who were recruited over the internet via Amazon’s Mechanical Turk service.

COMPAS was no better than the study’s participants in assessing the risk of a defendant reoffending. The study’s authors compared the recommendations made by COMPAS and the participants with real-world data on reoffenders. In practice, more white people who were predicted not to reoffend did (40.3% humans, 47.9% COMPAS) compared with black people (29.2% humans, 30.9% COMPAS). Moreover, a larger proportion of black criminals were wrongly predicted to reoffend (37.1% humans, 40.4% COMPAS) compared with white defendants (27.2% humans, 25.4% COMPAS).

While COMPAS and human judgements were similar, failed judgement calls tended to favor white defendants and disadvantage black defendants. False positives are instances where criminals do not reoffend, but were predicted to. False negatives are instances where criminals were predicted to reform but did not.

The Dartmouth researchers also managed to reproduce the software’s predictions by consulting only 5% of the information the algorithm purportedly considers.

“The widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise,” write the study’s authors. “A simple linear predictor provided with only two features is nearly equivalent to COMPAS with its 137 features.”

Criminal Sentencing Algorithm No More Accurate Than Random People on the Internet

Although crime rates have fallen steadily since the 1990s, rates of recidivism remain a factor in the areas of both public safety and prisoner management. The National Institute of Justice defines recidivism as “criminal acts that resulted in rearrest, re-conviction or return to prison with or without a new sentence,” and with over 75 percent of released prisoners rearrested within five years, it’s apparent there’s room for improvement. In an effort to streamline sentencing, reduce recidivism and increase public safety, private companies have developed criminal justice algorithms for use in the courtroom. These tools — sometimes called “risk assessments” — aim to recommend sentence length and severity specific to each defendant based on a set of proprietary formulae. Unfortunately, the algorithms’ proprietary nature means that neither attorneys nor the general public have access to information necessary to understand or defend against these assessments.

There are dozens of these algorithms currently in use at federal and state levels across the nation. One of the most controversial, the Correctional Offender Management Profiling for Alternative Sanctions or COMPAS, made headlines in 2016 when defendant Eric Loomis received a six-year sentence for reckless endangerment, eluding police, driving a car without the owner’s consent, possession of firearm, probation violation and resisting arrest — a sentence partially based on his COMPAS score. Loomis, a registered sex offender, challenged the verdict, claiming COMPAS violated his constitutional right of due process because he could not mount a proper challenge. His argument was two-fold: that the proprietary nature of the formula denied him and his defense team access to his data, and that COMPAS takes into account race and gender when predicting outcomes, which constitutes bias. His case was denied by the lower court, but Loomis refused to back down, instead appealing to the Wisconsin Supreme Court.

In July of 2016, a unanimous decision by the Wisconsin Supreme Court upheld the state’s decision to use automated programs to determine sentencing. In her opinion, Justice Ann Walsh Bradley wrote: “Although it cannot be determinative, a sentencing court may use a COMPAS risk assessment as a relevant factor for such matters as: (1) diverting low-risk prison-bound offenders to a non-prison alternative; (2) assessing whether an offender can be supervised safely and effectively in the community; and (3) imposing terms and conditions of probation, supervision, and responses to violations.” In response to Loomis’ contention that race and particularly gender can skew results and interfere with due process, Bradley further explained that “considering gender in a COMPAS risk assessment is necessary to achieve statistical accuracy." Her opinion further cautioned that judges should be made aware of potential limitations of risk assessment tools and suggested guidelines for use such as quality control and validation checks on the software as well as user education.

A report from the Electronic Privacy Information Center (EPIC), however, warns that in many cases issues of validity and training are overlooked rather than addressed. To underscore their argument, EPIC, a public interest research center that focuses public attention on emerging privacy and civil liberties issues, compiled a chart matching states with the risk assessment tools used in their sentencing practices. They found more than 30 states that have never run a validation process on the algorithms in use within their state, suggesting that most of the time these programs are used without proper calibration.

The Problem with COMPAS

In states using COMPAS, defendants are asked to fill out a COMPAS questionnaire when they are booked into the criminal justice system. Their answers are analyzed by the proprietary COMPAS software, which generates predictive scores such as “risk of recidivism” and “risk of violent recidivism.”

These scores, calculated by the algorithm on a one-to-10 scale, are shown in a bar chart with 10 representing those most likely to reoffend. Judges receive these charts before sentencing to assist with determinations. COMPAS is not the only element a judge is supposed to consider when determining length and severity of sentence. Past criminal history, the circumstances of the crime (whether there was bodily harm committed or whether the offender was under personal stress) and whether or not the offender exhibits remorse are some examples of mitigating factors affecting sentencing. However, there is no way of telling how much weight a judge assigns to the information received from risk assessment software.

Taken on its own, the COMPAS chart seems like a reasonable, even helpful, bit of information; but the reality is much different. ProPublica conducted an analysis of the COMPAS algorithm and uncovered some valid concerns about the reliability and bias of the software.

In an analysis of over 10,000

Sentence by Numbers: The Scary Truth Behind Risk Assessment Algorithms

(Photo: Joe Raedle/Getty Images)

Dozens of people packed into a Philadelphia courtroom on June 6th to voice their objections to a proposed criminal justice algorithm. The algorithm, developed by the Pennsylvania Commission on Sentencing, was conceived of as a way to reduce incarceration by predicting the risk that a person would pose a threat to public safety and helping to divert those who are at low risk to alternatives to incarceration.

ADVERTISEMENT Thanks for watching! Visit Website

But many of the speakers worried the tool would instead increase racial disparities in a state where the incarceration rate of black Americans is nine times higher than that of white people. The outpouring of concern at public hearings, as well as from nearly 2,000 people who signed an online petition from the non-profit Color of Change, had a big effect: While the sentencing commission had planned to vote June 14th on whether to adopt the algorithm, members decided to delay the vote for at least six months to consider the objections and to solicit further input.

Algorithms that make predictions about future behavior based on factors such as a person's age and criminal history are increasingly used—and increasingly controversial—in criminal justice decision-making. One of the big objections to the use of such algorithms is that they sometimes operate out of the public's view. For instance, several states have adopted a tool called COMPAS developed by the company Northpointe (now called Equivant), which claims the algorithm is proprietary and refuses to share crucial details of how it calculates scores.

ADVERTISEMENT Thanks for watching! Visit Website

ADVERTISEMENT Thanks for watching! Visit Website

In a striking contrast, the Pennsylvania sentencing commission has been very transparent. Legislation passed in 2010 tasked the commission with developing a risk assessment instrument for use by judges at sentencing "as an aide in evaluating the relative risk that an offender will reoffend and be a threat to public safety," and to help identify candidates for alternatives to incarceration. Since 2010, the commission has released more than 15 reports detailing the development of the algorithm and has held 11 public hearings to invite feedback. The commission has also altered its proposal over time in response to the community's input. For example, the Defender Association of Philadelphia and other organizations argued in 2017 that the use of past arrest record as an input factor would be likely to exacerbate racial bias, and this concern was a factor in the commission's decision to switch to using convictions rather than arrests.

But advocates still have concerns about other elements of the algorithm. For instance, the commission found that predictions as to who was at high risk to be re-arrested for a violent crime (a "crime against a person") had just 18 percent accuracy and so decided not to rely on this information. Instead, the instrument predicts general "recidivism," defined as re-arrest on any misdemeanor or felony charge in Pennsylvania within a three-year period, or for recommitment to the Department of Corrections for a technical violation of parole. (The exception is the few cases where a person is given a low risk score for crime against a person but high risk for general recidivism, in which case both scores would be shown.)

There is widespread agreement that the commission deserves a lot of credit here. It's rare to see designers of a criminal justice algorithm lay out the details, including the design process, in so much detail. (By comparison, basic information such as the inputs used in COMPAS remain out of public view, after a lawsuit in Wisconsin unsuccessfully challenged its use at sentencing as a violation of the right to due process.) Mark Houldin, policy director at the Defender Association of Philadelphia, said: "The Commission's approach of publishing reports on each step is really great, and any jurisdiction that is seeking to adopt a tool like this should do exactly that. If the math, assumptions, and the decisions aren't transparent, there is no way to allow stakeholders and the community to have a say."

Even so, Houldin doesn't think the proposed algorithm should be used at all. Given that the commission found it couldn't predict risk of violent re-arrest with reasonable accuracy, criminal defense attorney Marni Snyder and other members of the Risk Assessment Task Force (formed by Snyder and Democratic state Representative Brian Sims) say that the commission, along with the legislature, should have reconsidered whether to propose the algorithm. Nyssa Taylor, criminal justice policy counsel for the American Civil Liberties Union of Pennsylvania, says that it is also extremely problematic that the definition of recidivism includes technical violations of parole—those in which no new crime is committed, like missing curfew and not notifying an officer of a change of address. She pointed out that Pennsylv

Can Racial Bias Ever Be Removed From Criminal Justice Algorithms?

When Netflix gets a movie recommendation wrong, you’d probably think that it’s not a big deal. Likewise, when your favourite sneakers don’t make it into Amazon’s list of recommended products, it’s probably not the end of the world. But when an algorithm assigns you a threat score from 1 to 500 that is used to rule on jail time, you might have some concerns about this use of predictive analytics.

To the general audience, predictive policing methods are probably best known from the 2002 science fiction movie Minority Report starring Tom Cruise. Based on a short story by Philip K. Dick, the movie presents a vision of the future in which crimes can be predicted and prevented. This may sound like a far-fetched utopian scenario. However, predictive justice already exists today. Built on advanced machine learning systems, there is a wave of new companies that provide predictive services to courts; for example, in the form of risk-assessment algorithms that estimate the likelihood of recidivism for criminals.

Can machines identify future criminals?

After his arrest in 2013, Eric Loomis was sentenced to six years in prison based in part on an opaque algorithmic prediction that he would commit more crimes. Equivant (formerly Northpointe), the company behind the proprietary software used in Eric Loomis’ case, claims to have provided a 360-degree view of the defendant in order to provide detailed algorithmic assistance in judicial decision-making.

This company is one of many players in the predictive justice field in the US. A recent report by the Electronic Privacy Information Center finds that algorithms are increasingly used in court to “set bail, determine sentences, and even contribute to determinations about guilt or innocence”. This shift towards more machine intelligence in courts, allowing AI to augment human judgement, could be extremely beneficial for the judicial system as a whole.

However, an investigative report by ProPublica found that these algorithms tend to reinforce racial bias in law enforcement data. Algorithmic assessments tend to falsely flag black defendants as future criminals at almost twice the rate as white defendants. What is more, the judges who relied on these risk-assessments typically did not understand how the scores were computed.

This is problematic, because machine learning models are only as reliable as the data they’re trained on. If the underlying data is biased in any form, there is a risk that structural inequalities and unfair biases are not just replicated, but also amplified. In this regard, AI engineers must be especially wary of their blind spots and implicit assumptions; it is not just the choice of machine learning techniques that matters, but also all the small decisions about finding, organising and labelling training data for AI models.

Biased data feeds biased algorithms

Even small irregularities and biases can produce a measurable difference in the final risk-assessment. The critical issue is that problems like racial bias and structural discrimination are baked into the world around us.

For instance, there is evidence that, despite similar rates of drug use, black Americans are arrested at four times the rate of white Americans on drug-related charges. Even if engineers were to faithfully collect this data and train a machine learning model with it, the AI would still pick up the embedded bias as part of the model.

Systematic patterns of inequality are everywhere. If you look at the top grossing movies of 2014/2015 you can see that female characters are vastly underrepresented both in terms of screen time and speaking time. New machine learning models can quantify these inequalities, but there are a lot of open questions about how engineers can proactively mitigate them.

Google’s recent “Quick, Draw!” experiment vividly demonstrates why addressing bias matters. The experiment invited internet users worldwide to participate in a fun game of drawing. In every round of the game, users were challenged to draw an object in under 20 seconds. The AI system would then try to guess what their drawing depicts. More than 20 million people from 100 nations participated in the game, resulting in over 2 billion diverse drawings of all sorts of objects, including cats, chairs, postcards, butterflies, skylines, etc.

But when the researchers examined the drawings of shoes in the data-set, they realised that they were dealing with strong cultural bias. A large number of early users drew shoes that looked like Converse sneakers. This led the model to pick up the typical visual attributes of sneakers as the prototypical example of what a “shoe” should look like. Consequently, shoes that did not look like sneakers, such as high heels, ballerinas or clogs, were not recognized as shoes.

Recent studies show that, if left unchecked, machine learning models will learn outdated gender stereotypes, such as “doctors” being male and “receptionists” being female. In a similar fashion, AI models

AI is convicting criminals and determining jail time, but is it fair?

In a study of COMPAS, an algorithmic tool used in the US criminal justice system , Dartmouth College researchers Julia Dressel and Hany Farid found that the algorithm did no better than volunteers recruited via a crowdsourcing site. COMPAS, a proprietary risk assessment algorithm developed by Equivant (formerly Northpointe), considers answers to a 137-item questionnaire in order to provide predictions that are used in making decisions about sentencing and probation. In a case brought by a defendant, who claimed that the use of an algorithm whose inner workings were a proprietary secret, violated due process.

Prior criticisms such as a 2017 study by ProPublica have led to debates about how to measure fairness. However, none tested the crucial claim, that the algorithm provided more accurate predictions than humans would. To test this, Dressel and Farid asked 400 non-expert volunteers to guess if a defendant would commit another crime within two years given short descriptions of defendants from ProPublica's investigation in which seven pieces of information were highlighted. On average, the group got the right answer 63% of the time - and 67% of the time if their answers were pooled. COMPAS's accuracy is 65%. Because Equivant does not disclose its algorithm for study, the researchers went on to build their own, making it as simple as possible; it showed an accuracy of 67%, even using only two pieces of information, the defendant's age and number of previous convictions. Other researchers have found similar results.

Farid and Dressel argue that the point is not that these algorithms should not be used, but that they should be understood and required to prove that they work before they are put into use determining the course of people's lives.

https://www.theatlantic.com/technology/archive/2018/01/equivant-compas-algorithm/550646/

tags: criminal justice, COMPAS, algorithms, research, scientific testing, Dartmouth, error rates, prediction

writer: Ed Yong

Publication: The Atlantic

Study finds algorithm no better than random people at predicting recidivism

PRELIMINARY STATEMENT AND STATEMENT OF INTEREST

Independent and adversarial review of software used in the

criminal legal system is necessary to protect the courts from

unreliable evidence and to ensure that the introduction of new

technology does not disadvantage the accused. Though such review

has detected outcome-determinative errors in probabilistic

genotyping software in the past, yet TrueAllele has never been

subject to such review. Amicus Upturn respectfully requests that

this Court grant the defense expert reviewer access to

TrueAllele under the terms requested by the defendant. This

access is necessary to determine whether TrueAllele is reliable

enough to be used in this case and to ensure that the

proprietary interests of software developers do not undermine

the integrity of the criminal legal system.

Upturn is a nonprofit organization based in Washington, D.C.

that seeks to advance equity and justice in the design,

governance, and use of technology. Upturn frequently presents

its work in the media, before Congress and regulatory agencies,

and before the courts in briefs like this one. Upturn has an

interest in seeing that forensic technology is not deployed in a

way that promotes private interests at the expense of fairness

and justice in the criminal legal system.

PROCEDURAL HISTORY AND STATEMENT OF FACTS

Amicus Upturn relies on the procedural history and statement

of facts as presented by the defense.

2

ARGUMENT

I. TrueAllele combines forensic science and software engineering,

each of which has its own risks and histories of failure.

Cybergenetics's DNA analysis software, TrueAllele, implements

probabilistic genotyping in computer code to attempt forensic

identification. One can think of TrueAllele as having three

layers, each of which has its own points of failure. The first

point of failure of TrueAllele is its complex and novel

scientific method—probabilistic genotyping. The second point of

failure is the statistical models, developed by Cybergenetics

itself, through which TrueAllele carries out the probabilistic

genotyping analysis. The third point of failure of TrueAllele is

software code, authored by Cybergenetics itself, that implements

the probabilistic genotyping algorithms. Failure at any of these

points may have harmful, and even fatal, consequences.

A. Flawed forensic science has been used to convict and execute

defendants before being subjected to appropriate scrutiny.

The first point of failure for software like TrueAllele is the

scientific basis it uses to draw evidentiary conclusions. Here,

there are many reasons to be cautious. Numerous evidentiary

techniques, initially hailed as groundbreaking and relied on in

criminal convictions, have been either found to have significant

errors or completely debunked. Arson science used to secure the

death sentence of Cameron Todd Willingham was “scientifically

proven to be invalid” by both a government commission and an

independent review by a panel of fire experts, but only after he

had been executed. David Grann, Trial by Fire, The New Yorker

3

(Aug. 31, 2009).1 The resulting national uproar and fundamental

reexamination of arson science led to the exoneration of Texas

inmate Ed Graf, but only after Graf had already served 26 years

in prison. Jeremy Stahl, The Trials of Ed Graf, Slate (Aug. 16,

2015).2 And in 2015, the FBI formally acknowledged flaws in its

forensic hair analysis used in thousands of trials spanning a

period of over two decades. Spencer S. Hsu, FBI Admits Flaws in

Hair Analysis over Decades, Wash. Post (Apr. 18, 2015).3 This

flawed analysis was used against thirty-two people who were

sentenced to death, fourteen of whom had already been executed

or died in prison. Ibid. This history of flawed forensic science

underscores that new forensic methods, such as probabilistic

genotyping, must be subject to rigorous review to prevent

wrongful convictions and executions.

B. Software engineering can independently introduce fatal flaws

even when the underlying scientific methods are sound.

Software can allow more efficient and comprehensive data

analysis—but it can also be biased, faulty, or completely

ineffective. At the design stage, the process of creating

software necessarily includes decisions and assumptions.

1 https://www.newyorker.com/magazine/2009/09/07/trial-by-fire.

2 http://www.slate.com/articles/news_and_politics/jurisprudence/

2015/08/ed_graf_arson_trial_texas_granted_him_a_new_trial_would_

modern_forensic.html. 3 https://www.washingtonpost.com/local/crime/fbi-overstatedforensic-hair-matches-in-nearly-all-criminal-trials-fordecades/2015/04/18/39c8d8c6-e515-11e4-b510-

962fcfabc310_story.html.

4

TrueAllele is no exception. It is these differing design

decisions that have resulted in variability in conclusions

across probabilistic genotyping software. For example, in a New

York case TrueAllele and another probabilistic genotyping

software produced different conclusions on the defendant’s guilt

for the same mixed DNA sample. President's Council of Advisors

on Science and Technology (PCAST), Report to the President:

Forensic Science in Criminal Courts: Ensuring Scientific

Validity of Feature-Comparison Methods 78 n.212 (2016)

[hereinafter PCAST Report].4 This is not a flaw by itself;

Cybergenetics should design their own models and write their own

code to implement probabilistic genotyping. In fact, these

design and programming choices are the precise reason why

TrueAllele’s developers want to safeguard their code. However,

the defense must have access to information about these design

choices because they can influence ostensibly objective results.

For example, the Forensic Statistical Tool, a peer to

TrueAllele, was found in a 2016 source code review to have a

hidden function that tended to overestimate the likelihood of

guilt. See Stephanie J. Lacambra et al., Opening the Black Box:

Defendants' Rights to Confront Forensic Software, NACDL: The

Champion (May 2018). Without independent review of TrueAllele’s

source code, there is no guarantee that TrueAllele does not have

4 https://obamawhitehouse.archives.gov/sites/default/files/

microsites/ostp/PCAST/pcast_forensic_science_report_final.pdf.

5

similar outcome-determinative functions that may also lead to

wrongful convictions and potentially fatal consequences.

Even when software is not designed with faulty assumptions,

unintentional errors can significantly impact the software’s

performance. Just this year, the UK’s Most Serious Violence

tool, a flagship artificial intelligence system designed to

predict future gun and knife violence, was found to have coding

flaws that experts concluded made it unusable. Matt Burgess,

Police Built an AI to Predict Violent Crime. It Was Seriously

Flawed, Wired (Aug. 6, 2020).5 After discovery of a coding error

that caused training data to be improperly ingested, the system,

originally claimed by its developer to be up to seventy-five

percent accurate, was demonstrated to be less than twenty

percent accurate. Ibid. And in 2015, investigators in Australia

encountered an error in their use of STRmix, a probabilistic

genotyping software program intended to resolve mixed DNA

profiles similar to TrueAllele. David Murray, Queensland

Authorities Confirm ‘Miscode’ Affects DNA Evidence in Criminal

Cases, The Courier Mail (Mar. 20, 2015).6 The error produced

incorrect results in at least sixty criminal cases, including a

high-profile murder case. Ibid. This is especially concerning

given STRmix’s striking similarities to TrueAllele—both are

5 https://www.wired.co.uk/article/police-violence-predictionndas.

6 https://www.couriermail.com.au/news/queensland/queenslandauthorities-confirm-miscode-affects-dna-evidence-in-criminalcases/news-story/833c580d3f1c59039efd1a2ef55af92b.

6

forensic identification software systems that use probabilistic

genotyping.

C. Allowing companies to shield their software from review

increases the risk of undetected failures.

As New Jersey courts have recognized across other contexts,

there is no substitute for independent and searching review to

find flaws in software that puts people’s lives at stake. When

such testing is not permitted, the consequences are disastrous.

Perhaps the most striking recent example is the failure of the

Boeing 737 Max 8 airplanes in 2018 and 2019, which killed 346

people and led to the grounding of over 300 737 Max passenger

jets worldwide. Boeing was able to evade independent review—a

cautionary tale that shows the consequences of letting financial

concerns take priority over human life. The Federal Aviation

Administration (FAA) did not thoroughly test Boeing’s new

Maneuvering Characteristics Augmentation System (MCAS) software

because Boeing stated the software was not “safety critical.”

This software, designed to counteract the weight of new, larger

engines, ultimately malfunctioned and led to two crashes. The

FAA should have served as an independent inspector, but

delegated too much of its responsibility to Boeing itself. See

Committee on Transportation and Infrastructure, Final Committee

Report: The Design, Development, and Certification of the Boeing

737 Max 57 (Sep. 15, 2020). Boeing was thus able to conceal

internal flight simulation testing data that showed pilots took

more than twice the time to mitigate an MCAS activation than

7

federal guidelines allow for. Ibid. at 13 & n.66. A technical

review found that the FAA was “unable to independently assess

the adequacy . . . of MCAS, which was a new and novel feature

that should have been closely scrutinized. Had FAA technical

staff been fully aware of the details of MCAS, they would have

likely identified the potential for the system to overpower

other flight controls, which was a major contributing factor

leading to the two MAX crashes.” Id. at 66–67. Like Boeing,

TrueAllele has relied upon self-validation: the main developer

of TrueAllele, Mark Perlin, has co-authored the majority of the

validation studies done on TrueAllele. In fact, there has never

been a complete external review of TrueAllele’s source code, nor

has there ever been independent and adversarial testing of

TrueAllele’s software to see how it performs under different

conditions. Whether Cybergenetics is aware of flaws in

TrueAllele or not, without independent review, defects in

TrueAllele may go unidentified just in Boeing’s MCAS.

The lack of independent external review in both cases is

enabled by the failure of safeguards meant to prevent such

cases. Regulatory capture in the aerospace industry led to

Boeing’s dedicated FAA reviewers failing to scrutinize the 737

Max thoroughly, in some instances bringing up concerns with

Boeing but failing to include those concerns in their report to

the FAA itself. Id. at 69–70. In the criminal legal system,

rather than a regulatory body, courts and the adversarial

process are the safeguards meant to ensure that evidence

8

generated by new technologies is reliable and appropriately

used. To uphold its role as a safeguard against wrongful

convictions based on questionable evidence, this Court must

ensure TrueAllele is thoroughly tested and scrutinized.

This Court should also consider the incentives that

Cybergenetics has to shield its technology from review. Perlin

has testified that Cybergenetics has invested millions in

TrueAllele, and that allowing independent review would pose an

unacceptable financial risk. See Decl. of Mark W. Perlin, at ¶

68, Washington v. Fair, No. 10-1-09274-5 SEA (Sup. Ct. King

Cnty. Wash.). Cybergenetics has placed untenable limitations on

defense access to TrueAllele’s source code, even under a

protective order, because its code constitutes trade secrets.

But the Boeing example has shown that, when a company is acting

based on monetary interests, harmful trade-offs may be made

between profit and safety or reliability. In the criminal legal

system in particular, TrueAllele’s monetary and proprietary

interests to shield its technology from review should not

outweigh the liberty interests at stake for defendants convicted

based on TrueAllele-produced evidence.

II. Each aspect of TrueAllele must be subject to independent

and adversarial review to ensure its reliability.

In New Jersey, the standard for admitting new scientific

evidence in criminal cases centers around the question of

“reliability.”. See State v. Chun, 194 N.J. 54 (2008). For

TrueAllele, this question cannot be properly addressed without

9

independent and adversarial review. TrueAllele’s first and

second layers—the underlying method and statistical models—must

be subject to independent validation studies to determine the

reliability of the underlying method as well as TrueAllele’s

specific approach and its limitations. For the third layer—

TrueAllele’s implementation in software—in addition to testing,

direct source code review is necessary to trace how design

specifications were implemented and to identify errors.

A. The reliability of TrueAllele’s approach to probabilistic

genotyping has only been partially addressed through existing

validation studies.

Even aside from issues of self-validation, software validation

and developmental validation of forensic methods are different.

Although validation studies may be able to determine the

validity of a scientific method, and perhaps even prevent

against failures in translating assumptions to software code,

they cannot fully guard against either coding or user error. For

example, validation studies are performed on specific versions

of the software. It is common for errors in coding to be

introduced when new versions are released.

The version of the TrueAllele software used in Mr. Pickett’s

case postdates every one of the validation studies cited in the

report prepared by Cybergenetics, as well as those cited in the

initial state’s brief in favor of the admission of TrueAllele.

Da19-21. None of the peer-reviewed studies listed as part of the

state’s appendix appear to be performed on the version of the

10

VUIer client (which is responsible for the match statistic) used

in this case. Ra454-55. Prior validation studies cannot replace

source code review because subsequent source code versions may

introduce new errors not present when validation was completed.

B. TrueAllele’s source code has not been independently reviewed.

Independent review of TrueAllele’s source code is a basic,

necessary step to ensuring that TrueAllele is reliable. See

Darrel C. Ince et al., The Case for Open Computer Programs,

Nature (Feb. 22, 2012) (explaining that “anything less than the

release of source programs is intolerable for results that

depend on computation”). Specifically, this level of review is a

necessary condition of ensuring the software is properly

implementing a program’s design specifications and that the code

is devoid of bugs that could affect the software’s output. See

Lacambra et al., at 32 (stating “programmed assumptions . . .

must be reviewed at the source code level for reliability and

accuracy”). The code in TrueAllele has never been scrutinized by

any party outside of Cybergenetics. See Natalie Ram, Innovating

Criminal Justice, 112 Nw. U. L. Rev. 659, 661 (2018) (noting

that “no one outside of Cybergenetics—Perlin’s company—has seen

or examined that source code”). However, adversarial and

independent source code review—particularly when performed by a

defense expert—is a necessary safeguard that prevents

probabilistic genotyping programs from doing serious harm.

Despite its limitations, source code review was able to catch

11

errors in the Forensic Statistical Tool (FST), the

aforementioned probabilistic genotyping program formerly used in

New York. In the course of a murder trial, the court granted a

defense expert full access to the program’s source code. See

Lauren Kirchner, Where Traditional DNA Testing Fails, Algorithms

Take Over, ProPublica (Nov. 4, 2016).7 This analysis produced two

alarming observations. First, the code did not seem to be

implementing the methods and models that were used in FST’s

validation studies. See Jessica Goldthwaite et al., Mixing It

Up: Legal Challenges to Probabilistic Genotyping Programs for

DNA Mixture Analysis, Champion (May 2018) at 12, 15 (noting

“disturbing differences between what FST was initially

advertised to be and what is actually being used in criminal

casework”). Second, there seemed to be coding errors that caused

results to favor the prosecution’s theory of the case. See id.

This is why it is so important that New Jersey, as it has in

the past, compel the release of proprietary source code to

defense experts to prevent the potential damage new, unchecked

technologies can cause. In a move aimed to protect the integrity

of evidence obtained through the Alcotest 7110 breathalyzer, the

Supreme Court of New Jersey compelled the breathalyzer’s maker,

Draeger Safety Diagnostics, to release its source code to

defense experts. See Chun, 194 N.J. 54 (2008). In 2018, New

7 https://www.propublica.org/article/where-traditional-dnatesting-fails-algorithms-take-over.

12

Jersey courts again took action to preserve the integrity of

trial evidence, addressing calibration issues in Draeger

technology used to obtain DWI convictions. See State v. Cassidy,

235 N.J. 482 (2018). Unlike Alcotest, TrueAllele has been

subject to peer-reviewed studies and Cybergenetics allows for

some inspection and review. But the proposed review conditions

are inconsistent with determining reliability. In light of the

gravity of these possible errors, this Court should move

similarly to act as a steward of emerging forensic technologies

and to subject TrueAllele to independent and adversarial review.

III. Admitting TrueAllele as scientific evidence into the New

Jersey criminal court system without independent and

adversarial review will harm the administration of justice.

In the criminal court system, the “gatekeeping” function of

judges works in tandem with later procedural safeguards such as

cross-examination and discovery rights to ensure that the

accused is adequately protected from questionable evidentiary

technology. Thus, requiring independent and adversarial review

in the Frye hearing stage is not simply an option, but rather a

necessity to preserve the integrity of the court system and the

rights of defendants. Frye v. United States, 293 F. 1013 (D.C.

Cir. 1923). New Jersey’s emphasis on reliability as the standard

for admitting new scientific evidence in criminal cases creates

the obligation for judges to act as gatekeepers by excluding

unreliable scientific evidence from criminal proceedings. New

Jersey courts have historically embraced this role, even going

13

so far as to vacate prior convictions based on questionable

scientific evidence en masse to preserve the interest of

justice. See, e.g., Cassidy, 235 N.J. at 497 (holding that over

20,000 cases relying on potentially unreliable breathalyzer

testing needed to be re-opened). The current question before the

Court is not whether TrueAllele is scientifically valid, but

rather whether, given the evidence in Parts I and II of this

brief and the court’s considerations of the administration of

justice, TrueAllele can be adequately determined to be reliable

without independent and adversarial review including full source

code access. Upon consideration of the balance between private

parties’ interests and defendants’ rights, it becomes clear that

TrueAllele must be thoroughly reviewed, not just rubber stamped.

A. Admitting scientific evidence without independent and

adversarial testing incentivizes secrecy and gives undue

influence to private, corporate actors.

Although independent and adversarial review is functionally

necessary to assess the reliability of new scientific evidence,

trade secrets are often invoked to combat attempts at

independent and adversarial review. Although often portrayed as

protective measures, trade secrets should not be prioritized

over considerations of justice. See Rebecca Wexler, Life,

Liberty, and Trade Secrets: Intellectual Property in the

Criminal Justice System, 70 Stan. L. Rev. 1343, 1358–71 (2018)

(noting how trade secret protections have led to increased

secrecy and difficulty for defendants throughout the criminal

14

legal system). Corporations, which prioritize profits and

competitive advantage, often argue that trade secrets are

necessary for business interests. However, the idea that courts

cannot protect both criminal defendants and corporate actors is

a false dichotomy in light of existing procedural safeguards

that can appropriately protect both private interests and the

administration of justice. To faithfully conduct a Frye

reliability analysis, independent and adversarial review and

testing should not be impeded by trade secret protections.

Rather, all relevant materials should be available to reviewing

experts, with appropriate procedural safeguards in place.

While the doctrine of trade secrecy has sometimes been used

reasonably in cases of extreme business need, the tendency of

companies today is to “change the traditional function of trade

secrecy from protecting against a competitor’s misappropriation

to a function that impedes public investigation.” See Sonia K.

Katyal, The Paradox of Source Code Secrecy, 104 Cornell L. Rev.

1183, 1246 (2019). This is particularly inappropriate in a Frye

hearing analysis since it creates a “contradictory paradox of

source code secrecy: on one hand, companies argue that their

methods are sufficiently known and proven to be broadly accepted

by the scientific community and yet, on the other hand,

companies will go to enormous lengths to keep their source code

confidential so as to preclude further investigation.” Id. at

1242–43. This tendency of trade secret protections towards

advancing secrecy at the expense of crucial analysis like

15

independent and adversarial testing is at odds with the aims of

the criminal legal system—a system built upon revelation and

truth seeking for the advancement of justice.

The emphasis on business prospects also leads companies to

protect their interests through extreme penalties, further

discouraging independent review and distorting the end goal of

justice. In this case, the prosecution suggested a $1,000,000

liability if any “proprietary materials are improperly handled,

negligently or otherwise.” Da235. The State’s concern for

Cybergenetics’s business prospects and the large monetary

penalty warp the incentives of parties and detract from the

central issue of finding and administering justice.

Additionally, the State’s attempt at imposing financial risk

through a large monetary penalty without a clear definition of a

breach highlights the undue influence that a corporation like

Cybergenetics can have on criminal proceedings.

Asserting a trade secret privilege in order to avoid

independent and adversarial review produces the precise

injustice that the New Jersey criminal court system seeks to

avoid. New Jersey’s law on trade secret privilege in the

criminal court system rejects any trade secret privilege that

will “tend to conceal fraud or otherwise work injustice.”

N.J.S.A. 2A:84A-26. From a procedural standpoint, concealing

information in a criminal case produces fundamental injustice

because it stifles a defendant’s rights and inhibits the

adversarial methodology of the court system. Recognizing this

16

tendency, the United States Supreme Court has said “evidentiary

privileges must be construed narrowly because privileges impede

the search for the truth.” Pierce County v. Guillen ex rel.

Guillen, 537 U.S. 129, 144 (2003). Courts have recognized that

trade secrets are not wholly independent concerns, prioritized

above all other legal analysis, but rather, considerations that

must not take precedence over substantial justice and an overall

search for the truth. Therefore, trade secrets should only be

invoked when they are absolutely necessary and do not impede the

overall goals of the judicial system.

Traditionally, trade secret protections were intended to

prevent malicious or accidental disclosures of vital information

that could hurt a business prospect. These considerations have

little relevance in the context of good-faith independent and

adversarial review, aimed at investigating the reliability of

the technology itself. Even if a court finds that disclosure is

a valid concern, there are better ways to protect proprietary

information than blocking source code review. Since source code

is routinely produced during discovery in civil cases,

“litigators have ready-made tools at their disposal to address

the merit of software related disputes while ensuring that

source code remains protected and yet disclosed in a litigation

dispute.” Katyal, at 1275–76. For example, New Jersey courts can

issue protective orders to protect source code from disclosure.

Thus, blocking production of key information needed to verify

scientific evidence in a criminal court case on the basis of

17

trade secrets is not only a questionable prioritization of

property over liberty, but also an unnecessary choice.

B. Allowing trade secrecy to prevent review violates the

procedural rights of this defendant and future defendants.

Admitting scientific evidence without independent and

adversarial review at the Frye hearing stage can hinder a

defendant’s ability to mount a defense and confront the basis of

the prosecution’s evidence in the subsequent criminal

proceedings. Historically, New Jersey has prioritized

defendants’ rights in the context of considering new scientific

evidence at every stage of criminal proceedings beginning with

the Frye hearing. Although a defendant’s rights are balanced

against other interests, New Jersey law has consistently

recognized the centrality of a defendant’s potential loss of

liberty in this analysis. Since reliability is the underlying

evaluation in a New Jersey Frye hearing, access to every piece

of information that may inform such a reliability assessment

about the new scientific evidence should be considered.

Admitting scientific evidence without independent and

adversarial review at the Frye hearing stage may not only allow

unreliable evidence but also directly undermine other safeguards

in the criminal legal system. Procedural safeguards in later

parts of the criminal process afford defendants the opportunity

to challenge admitted evidence. See Daubert v. Merrell Dow

Pharmaceuticals, Inc., 509 U.S. 579, 596 (1993) (“Vigorous

cross-examination, presentation of contrary evidence, and

18

careful instruction on the burden of proof are the traditional

appropriate means of attacking shaky but admissible evidence.”).

However, these subsequent safeguards are not always adequate.

Defendants can experience substantial difficulty challenging

“shaky evidence” when the inner workings of the tools that

produced such evidence are not fully known. See State in

Interest of A.B., 219 N.J. 542 (2014) (noting that “[a] criminal

trial where the defendant does not have access to the raw

materials integral to the building of an effective defense is

fundamentally unfair” (internal quotations omitted)). For one

thing, once evidence has been admitted, the onus is flipped to

the defendant to compel discovery with a subpoena and argue the

defense’s necessity for this particular portion of information—

the exact opposite of what the Daubert court had envisioned. See

Katyal, at 1245. Moreover, the same trade secret protections

that create opacity during a Frye hearing can continue to

prohibit analysis at later stages of the criminal process. Thus,

subsequent opportunities to address and attack the unreliability

or secrecy of new scientific technology are never guaranteed to

a defendant after the Frye hearing stage. Since a lack of

independent and adversarial review at the Frye hearing stage

cannot be replaced by subsequent procedural safeguards, the

gatekeeping function of judges at the Frye hearing stage is

fundamental to procuring justice.

The benefits of independent and adversarial review of forensic

technologies extend beyond the life of a criminal case as well.

19

Rigorous scrutiny during a Frye hearing can help prevent future

litigation if a technology is later proven to have been

inaccurate. Not only is this desirable from a judicial economy

perspective, vetting for inaccuracies at an early stage in

litigation can also preventively protect against wrongful

convictions, preserving both the court’s integrity and future

defendants’ rights. In Cassidy, rigorous scrutiny of

breathalyzer technology proved the test to be unreliable,

causing the re-opening of 20,000 cases. See Cassidy, 235 N.J. at

  1. While the Supreme Court of New Jersey’s decision to remedy

this injustice regardless of the administrative burden on the

system is admirable, adequate scrutiny during earlier stages of

the court proceedings could have prevented the need for relitigation altogether. Catching errors at the first possible

opportunity is particularly crucial within the criminal legal

system, since re-litigation cannot always remedy the damages

caused by admitting inaccurate scientific evidence. For example,

in a 2018 study, the National Registry of Exonerations

determined that the known false convictions in the United States

since 1989 totaled 20,080 years behind bars. See Radley Balko,

Report: Wrongful Convictions Have Stolen at Least 20,000 Years

from Innocent Defendants, Wash. Post (Sept. 10, 2018).8

Furthermore, in a legal system that has already been scrutinized

8 https://www.washingtonpost.com/news/opinions/wp/2018/09/10/

report-wrongful-convictions-have-stolen-at-least-20000-yearsfrom-innocent-defendants/.

20

for its wide racial and economic disparities, issues of fairness

are inherently issues of equity as well. See, e.g., Radley

Balko, 21 More Studies Showing Racial Disparities in the

Criminal Justice System, Wash. Post (Apr. 9, 2019).9 Therefore,

early application of independent and adversarial testing in the

Frye hearing stage can be beneficial in terms of judicial

economy and prevents perpetuation of future injustice.

CONCLUSION

Novel forensic methods and software used in the criminal legal

system and other high-stakes contexts that have not been subject

to sufficient review have historically had incredibly harmful

consequences. For probabilistic genotyping in particular, STRmix

and FST have both been revealed to have outcome-determinative

errors. In the case of FST these errors were identified through

independent source code review by the defense. While there is

also a larger question of whether probabilistic technology

should be used in the criminal legal system at all, cf. Emily

Berman, Individualized Suspicion in the Age of Data, 105 Iowa L.

Rev. 263 (2020), at minimum, the court should utilize its

gatekeeping role in the Frye hearing stage to require

independent and adversarial review of TrueAllele, including its

source code, in the interest of preserving the integrity of the

New Jersey criminal legal system.

Amicus Brief in New Jersey v. Pickett

Recidivism risk assessment is the process of determining the likelihood that an accused, convicted, or incarcerated persons will reoffend. The process is aimed at assisting in the determination of the appropriate limitation on the freedom of the subject. With innovation in technology especially in the area of artificial intelligence (AI), recidivism risk assessment tools built on AI technology are now well-developed and used in the criminal justice system. Algorithmic tools are increasingly being used in the Canadian criminal justice system in pre-trial, sentencing, and post-sentencing phases in predicting the future criminal behaviour of accused, convicted, or incarcerated persons. The increasing use of AI technology in recidivism risk assessment in the criminal justice system raises many legal issues. I will discuss three of such issues in this blog post.

The first issue relates to what I refer to as algorithmic racism. This arises from the use of historical data in training AI risk assessment tools. This has the tendency to perpetuate historical bias which are replicated in the risk assessments by these AI tools. The second issue relates to the legality of the use of AI risk assessments by judges in sentencing decisions. I argue that AI risk assessment based on data from a general population has the tendency to deprive a convict of the right to an individualized sentence based on accurate information. Third, there is the problem with the proprietary nature of the methodology used in the AI risk assessment tools. This is especially the case where offenders challenging their criminal sentence seeks access to these proprietary trade secrets.

Issues with the use of AI in criminal justice risk assessment

Algorithmic Racism – Old wine in a new bottle

By algorithmic racism, I refer to systemic, race-based bias arising from the use of AI-powered tools in the analysis of data in decision making resulting in unfair outcomes to individuals from a particular segment of the society distinguished by race. AI is subsistent on big data. Vast amount of data is required to train AI algorithm to enable it make predictions. Some of the data used to train the AI algorithm in recidivism risk assessment are historical data from eras of mass incarceration, biased policing, and biased bail and sentencing regimes characterised by systemic discrimination against particular sections of the society.

Canada is not immune from the problems associated with data from biased policing as evident from data from the decades-old “carding” practice by some police departments. Toronto, Edmonton and Halifax are notorious for this practice which has been routinely criticized for disproportionately targeting young Black and Indigenous people. Of more serious concern is the fact that some of this data is blind to recent risk reduction and anti-discrimination reforms aimed at addressing the over-representation of particular segments of the society in the criminal justice system. Unlike explicit racism which is self-evident and obvious, algorithmic racism is not overtly manifest, but rather is obscured and buried in data. In fact, it is further blurred by a belief system that tends to portray technology as race neutral and colour blind. Algorithmically biased assessments by AI tools (unlike expert evidence) are accepted in the criminal justice system without further examination or cross examination. This is related to “anchoring” – a term used by behavioural scientists to refer to cognitive bias that arises from human tendency to rely on available piece of data in decision making with little regard (if any) to flaws in the data.

Understanding algorithmic racism would require the use of the appropriate lens to examine its hidden tentacles embedded or obscured in AI risk assessment technologies used in the criminal justice system. Critical Race Theory (CRT) provides a fitting lens for the study of algorithmic racism. CRT was developed by legal scholars who were intent in understanding the lived experiences of people of colour in a judicial system that presented itself as objective and race neutral. CRT adopts the notion that racism in endemic in society. According to Devon W. Carbado, in “Critical What What?” (2011) 43:5 Connecticut Law Review 1593 at 1609, CRT challenges the dominant principles that colour-blindness results in race neutrality and that colour consciousness generates racial preferences. CRT sees the notion of colour-blindness as a mechanism that rather blinds people to racist policies which further perpetuates racial inequality.

Bennett Capers noted that the writings that influenced the critical race movement tend to center on some recurring themes such as – that colour-blind laws tend to conceal real inequality in the society, that reforms that apparently benefit the minorities are only possible when they are in the interest of the white majority, and that race tends to be avoided in the law. CRT scholars are progressively utilizing research studies on implicit bias to illustrate the assertions. Examination of these research studies and data tends to unmask the implicit racism buried in laws and social practices that results in unfair outcomes or bias against individuals from particular segment of the society characterised by race.

Using CRT to study these AI risk assessment tools and their operation will reveal how these new technologies reinforce implicit and explicit bias against minority groups – especially Blacks and Indigenous offenders who constitute a disproportionate population in the Canadian criminal justice system.

“Madea Goes to Jail” – Individualized versus Generalized Sentence

An important issue that goes to the legality of risk assessment tools in criminal justice sentencing relates to the use of group-based analytics in sentencing decisions pertaining to an individual offender as opposed to an individualized sentence based on accurate information specific to the offender. In a best-case scenario, algorithmic risk assessment tools base their risk assessment on general factors similar to the offender’s background and not particularly specific to the offender. In R. v. Jackson 2018 ONSC 2527 (CanLII), Justice Nakatsuru of the Ontario Superior Court noted that “[s]entencing is and has always been a very individual process. A judge takes into account the case-specific facts of the offence and the offender to determine a just and fit sentence… The more a sentencing judge truly knows about the offender, the more exact and proportionate the sentence can be.” (para 3 [emphasis added]).

Modern recidivism risk assessment tools built on algorithms and big data provide anything but individualized assessment or prediction of recidivism. At best they provide predictions based on average recidivism of a general population of people who share characteristics similar to that of the accused. This process has the inadvertent tendency to perpetuate stereotypes associated with certain group (e.g. racial minorities). Sentencing judges as front-line workers in the criminal justice system have an obligation to ensure that the information they utilize in their sentencing decisions does not directly or indirectly contribute to negative stereotypes and discriminations. R. v. Ipeelee, 2012 SCC 13 (CanLII) at para 67.

The use of recidivism risk assessment tools is very common in the Canadian criminal justice system. Kelly Hannah-Moffat in “Actuarial Sentencing: An ‘Unsettled’ Proposition”, noted the tendency by lawyers and probation officers to classify individuals who obtain high risk assessment scores as high-risk offenders rather than simply as individuals who share similar characteristics with average members of that group. She noted that “[I]nstead of being understood as correlations, risk scores are misconstrued in court submissions, pre-sentence reports, and the range of institutional file narratives that ascribe the characteristics of a risk category to the individual.” (at page 12)

Our criminal justice system is founded on the idea that people should be treated as individuals under the law and not as part of a statistic – and this applies even in criminal justice sentencing. Hence, recidivism risk assessment technologies built on AI and big data have been criticized as depriving the accused of the right to an individualized sentence based on accurate information. (See the US case of State v. Loomis, 881N.W.2d 749 (Wis. 2016) hereafter Loomis). This raises a Section 7 Charter issue arising from the constitutionality of assessments made by the technology, especially in relation to the right of a convicted offender to an individualized sentencing based on accurate information.

In Loomis, the offender challenged the use of algorithmic risk assessment in his sentencing. He argued that the use of the risk assessment score generated by COMPAS algorithmic risk assessment tool violated his right to an individualized sentence because the tool relied on information about a broader group in making an inference about his likelihood of recidivism. The offender argued that any consideration of information about a broader population in the determination of his likelihood of recidivism violates his right to due process. The Loomis court noted the importance of individualized sentencing in the criminal justice system and acknowledged the fact that COMPAS data on recidivism in not individualized, but rather based on data of groups similar to the offenders.

Sentencing is a critical aspect of our criminal justice system. The more the sentencing judges know about offenders’ past, present and likely future behaviour including their personal background – historical, social, and cultural – the more exact and proportional sentences the judges are able to craft. While risk scores can effectively compliment judges’ effort to craft appropriate sentences, judges should always be reminded that algorithmic risk scores are just one of many factors to be used in the determination of appropriate sentences and, hence, appropriate weight should be attached to this factor along with many other factors in order to ensure that the sentence being imposed on the offender is as individualized as it could be. Justice Nakatsuru in R. v. Jackson rightly observed that:

A sentence imposed based upon a complex and in-depth knowledge of the person before the court, as they are situated in the past and present reality of their lived experience, will look very different from a sentence imposed upon a cardboard cut-out of an “offender” (at para 103).

Sentencing judges should not at any point in the sentencing process relent to use their discretion to overrule or ignore algorithmic risk scores that seems to fall out of tune with other factors considered in the sentencing process, especially where such risk scores tend to aggravate rather than mitigate the sentence.

Another problem that may further impair the ability of AI risk assessment tools to achieve individualized risk score arises where AI tools that are developed and tested on data from a particular group are used on another group not homogeneous to the original group. This will result in representation bias. Attempts to deploy AI technology to group(s) not effectively represented in the training data used to develop and train the technology would usually result in flawed and inaccurate results. This problem has been evident in facial recognition software. A report in The New York Times shows that AI facial recognition software in the market today are developed and trained on data of predominantly white males. While the software has been able to achieve 99 percent accuracy in recognizing faces of white males, this has not been the case with other races and the female gender. The darker the skin, the more inaccurate and flawed the result – up to 35% error rate for darker skinned women.

In Ewert v. Canada 2018 SCC 30 (CanLII), an Indigenous man brought a Charter challenge against his risk assessment by the Correctional Services Canada (CSC). The tools used in the assessment had been developed and tested on predominantly non-Indigenous population. The Supreme Court ruled that CSC’s obligation under s.24(1) Corrections and Conditional Release Act applies to results generated by risk assessment tools. Thus, an algorithmic tool which is developed and trained based on data from one predominant cultural group will be, more likely than not, cross-culturally variant to some extent when applied to another cultural group not represented (or not adequately represented) in the data used to train the tool. This will be unlikely to generate an individualized assessment of the offender, but will be more likely to result in flawed assessment of the risk posed by the offender.

Proprietary Right versus Charter Right

The methodology used in the assessment of recidivism by AI risk assessment tools are considered proprietary trade secrets and not generally available for scrutiny by the court, the accused, or the prosecution. The proprietary rights attached to these tools restrict the ability of the judge, the prosecution, or the accused to access or determine what factors are taken into consideration in the assessment, and how much weight is attached to those factors. The secretive process associated with these tools becomes problematic where offenders challenging their adverse sentence resulting from assessments made by these tools seek access to the proprietary information to prove the arbitrariness of the deprivation of their liberty or to invalidate the sentence resulting from the assessment.

In our criminal justice system, an accused person has the charter rights to personal liberty and procedural fairness both at trial as well as at sentencing. These rights also arise at incarceration when decisions affecting the liberty of the offender are being made by correctional officers (e.g. security classification). (See May v. Ferndale Institution, 2005 SCC 82 (CanLII) at para 76 (hereafter May v. Ferndale). The imposition of a criminal sentence requiring incarceration clearly involves a deprivation of the charter rights of the offender. Such deprivation must be in accordance with the law. But what if a convicted offender seeks access to a proprietary trade secret in a commercial AI tool used to assess the recidivism resulting in the criminal sentence? This will give rise to conflict between the proprietary right of a business corporation to its trade secret and the charter rights of the offender.

In May v. Ferndale, Correctional Services Canada (CSC) had used a computerised risk tool – Security Reclassification Scale (SRS) – in reviewing of the security classification of some inmates from minimum to medium risk. The inmates sought access to the scoring matrix used by the computerised SRS tool. They were denied access by the CSC. The Supreme Court of Canada ruled that the inmates were clearly entitled to access the SRS scoring matrix, and that a failure to disclose the information constituted a major breach of procedural fairness. According to the court:

The appellants were deprived of information essential to understanding the computerized system which generated their scores. The appellants were not given the formula used to weigh the factors or the documents used for scoring questions and answers. The appellants knew what the factors were, but did not know how values were assigned to them or how those values factored into the generation of the final score. (at para 117 [Emphasis added])

The Supreme Court noted that it was commonsensical that the matrix scores as well as the methodology used in arriving at the security classification should have been made available to the inmates:

As a matter of logic and common sense, the scoring tabulation and methodology associated with the SRS classification score should have been made available. The importance of making that information available stems from the fact that inmates may want to rebut the evidence relied upon for the calculation of the SRS score and security classification. This information may be critical in circumstances where a security classification depends on the weight attributed to one specific factor. (at para 118 [Emphasis added])

Conclusion

Artificial intelligence technologies will continue to revolutionize our justice system. If properly used, it could greatly enhance efficient and effective administration of our criminal justice system. However, the use of AI in our criminal justice system does raise some serious and novel legal issues that must be addressed. It is important to study these legal issues with the ultimate objective of developing a framework that will mitigate the adverse and discriminatory impacts of these technologies on the rights of accused persons and offenders.

Artificial Intelligence, Algorithmic Racism and the Canadian Criminal Justice System

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents