Associated Incidents
Late last week and over the weekend, New York City newspapers, including the New York Times and Wall Street Journal, published the value-added scores (teacher data reports) for thousands of the city’s teachers. Prior to this release, I and others argued that the newspapers should present margins of error along with the estimates. To their credit, both papers did so.
In the Times’ version, for example, each individual teacher’s value-added score (converted to a percentile rank) is presented graphically, for math and reading, in both 2010 and over a teacher’s “career” (averaged across previous years), along with the margins of error. In addition, both papers provided descriptions and warnings about the imprecision in the results. So, while the decision to publish was still, in my personal view, a terrible mistake, the papers at least make a good faith attempt to highlight the imprecision.
That said, they also published data from the city that use teachers’ value-added scores to label them as one of five categories: low, below average, average, above average or high. The Times did this only at the school level (i.e., the percent of each school’s teachers that are “above average” or “high”), while the Journal actually labeled each individual teacher. Presumably, most people who view the databases, particularly the Journal's, will rely heavily on these categorical ratings, as they are easier to understand than percentile ranks surrounded by error margins. The inherent problems with these ratings are what I’d like to discuss, as they illustrate important concepts about estimation error and what can be done about it.
First, let’s quickly summarize the imprecision associated with the NYC value-added scores, using the raw datasets from the city. It has been heavily reported that the average confidence interval for these estimates – the range within which we can be confident the “true estimate” falls - is 35 percentile points in math and 53 in English Language Arts (ELA). But this oversimplifies the situation somewhat, as the overall average masks quite a bit of variation by data availability. Take a look at the graph below, which shows how the average confidence interval varies by the number of years of data available, which is really just a proxy for sample size (see the figure's notes).
When you’re looking at the single-year teacher estimates (in this case, for 2009-10), the average spread is a pretty striking 46 percentile points in math and 62 in ELA. Furthermore, even with five years of data, the intervals are still quite large – about 30 points in math and 48 in ELA. (There is, however, quite a bit of improvement with additional years. The ranges are reduced by around 25 percent in both subjects when you use five years data compared with one.)
Now, opponents of value-added have expressed a great deal of outrage about these high levels of imprecision, and they are indeed extremely wide – which is one major reason why these estimates have absolutely no business being published in an online database. But, as I’ve discussed before, one major, frequently-ignored point about the error – whether in a newspaper or an evaluation system – is that the problem lies less with how much there is than how you go about addressing it.
It’s true that, even with multiple years of data, the estimates are still very imprecise. But, no matter how much data you have, if you pay attention to the error margins, you can, at least to some degree, use this information to ensure that you’re drawing defensible conclusions based on the available information. If you don’t, you can't.
This can be illustrated by taking a look at the categories that the city (and the Journal) uses to label teachers (or, in the case of the Times, schools).
Here’s how teachers are rated: low (0-4th percentile); below average (5-24); average (25-74); above average (75-94); and high (95-99).
To understand the rocky relationship between value-added margins of error and these categories, first take a look at the Times’ “sample graph” below.
This is supposed to be a sample of one teacher’s results. This particular hypothetical teacher’s value-added score was at the 50th percentile, with a margin of error of plus or minus roughly 30 percentile points. What this tells you is that we can have a high level of confidence that this teacher’s “true estimate” is somewhere between the 20th and 80th percentile (that’s the confidence interval for this teacher), although it is more likely to be closer to 50 than 20 or 80.
One shorthand way to see whether teachers scores are, accounting for error, average, above average or below average is to see whether their confidence intervals overlap with the average (50th percentile, which is actually the median, but that's a semantic point in these data).
Let’s say we have a teacher with a value-added score at the 60th percentile, plus or minus 20 points, making for a confidence interval of 40-80. This crosses the average/median “border," si