Incident 239: Algorithmic Teacher Evaluation Program Failed Student Outcome Goals and Allegedly Caused Harm Against Teachers

Description: Gates-Foundation-funded Intensive Partnerships for Effective Teaching Initiative’s algorithmic program to assess teacher performance reportedly failed to achieve its goals for student outcomes, particularly for minority students, and was criticized for potentially causing harm against teachers.
Alleged: Intensive Partnerships for Effective Teaching developed and deployed an AI system, which harmed students , low-income minority students and Teachers.

Suggested citation format

Dickinson, Ingrid. (2009-09-01) Incident Number 239. in Lam, K. (ed.) Artificial Intelligence Incident Database. Responsible AI Collaborative.

Incident Stats

Incident ID
Report Count
Incident Date
Khoa Lam

Reports Timeline


New ReportNew ReportDiscoverDiscover

Incidents Reports

The Gates Foundation’s big-data experiment wasn’t just a failure. It did real harm.

The Gates Foundation deserves credit for hiring an independent firm to assess its $575 million program to make public-school teachers more effective. Now that the results are in, it needs to be no less open in recognizing just how wasteful — and damaging — the program has been.

The initiative, known as Intensive Partnerships for Effective Teaching, sought to improve education for low-income minority students, in large part by gathering data and using an algorithm to assess teacher performance. It focused on measures such as test scores, the observations of school principals and evaluations from students and parents to determine whether teachers were adding value. The goal: Reward good teachers, get rid of bad ones and narrow the achievement gap.

Laudable as the intention may have been, it didn’t work. As the independent assessment, produced by the Rand Corporation, put it: “The initiative did not achieve its goals for student achievement or graduation,” particularly for low-income minority students. The report, however, stops short of drawing what I see as the more important conclusion: The approach that the Gates program epitomizes has actually done damage. It has unfairly ruined careers, driving teachers out of the profession amid a nationwide shortage. And its flawed use of metrics has undermined science.

The program’s underlying assumption, common in the world of “big data,” is that data is good and more data is better. To that end, genuine efforts were made to gather as much potentially relevant information as possible. As such programs go, this was the best-case scenario.

Still, to a statistician, the problems are apparent. Principals tend to give almost all teachers great scores — a flaw that the Rand report found to be increasingly true in the latest observational frameworks, even though some teachers found them useful. The value-added models used to rate teachers — typically black boxes whose inner workings are kept secret — are known to be little better than random number generators, and the ones used in the Gates program were no exception. The models’ best defense was that the addition of other measures could mitigate their flaws — a terrible recommendation for a supposedly scientific instrument. Those other measures, such as parent and student surveys, are also biased: As every pollster knows, the answer depends on how you frame the question.

Considering the program’s failures — and all the time and money wasted, and the suffering visited upon hard-working educators — the report’s recommendations are surprisingly weak. It even allows for the possibility that trying again or for longer might produce a better result, as if there were no cost to subjecting real, live people to years of experimentation with potentially adverse consequences. So I’ll compensate for the omission by offering some recommendations of my own.

  1. Value-added models (and the related “student growth percentile” models) are statistically weak and should not be used for high-stakes decisions such the promotion or firing of teachers.

  2. Keeping assessment formulas secret is an awful idea, because it prevents experts from seeing their flaws before they do damage.

  3. Parent surveys are biased and should not be used for high-stakes decisions.

  4. Principal observations can help teachers get better, but can’t identify bad ones. They shouldn’t be used for high-stakes decisions.

  5. Big data simply isn’t capable yet of providing a “scientific audit” of the teaching profession. It might never be.

Let me emphasize that unleashing such experiments on people is the most wasteful possible way to do science. As we introduce artificial intelligence in myriad areas — insurance, credit, human resources, college administration — will we require the people affected to trust the algorithm until, decades later, it proves to be horribly wrong? How many times must we make this mistake before we demand more scientific testing beforehand?

I’m not an entirely disinterested observer. I have a company that offers algorithm testing services. But I got into the business precisely because I wanted to avert disasters like this. It’s not enough to glean some lessons, make adjustments and move on. For the sake of data science, and for the sake of disadvantaged students, it’s crucial that the Gates Foundation recognize publicly how badly it went wrong.

Here's How Not to Improve Public Schools