Each fall, thousands of runners descend on the Big Apple to run the New York City marathon. They’ve trained hard all year, and give their all on the course. Long after the elite runners have finished, they stream across the finish line in clumps, exhausted at the end of their 26.2-mile journey. In the middle of the pack, as many as eight or 10 runners might cross the finish line in a single second, and nearly 400 in a single minute.
The difference between a time of 4:08:00 and 4:09:00, however, isn’t large enough to be important. It’s the difference between a rate of 9:28 per mile and 9:30 per mile. Given the vagaries of marathon running — the wind, the temperature, the features of the course — it would be unwise to conclude that the runner who crossed the finish line in 4:08:00 is a much better marathoner than the one who finished in 4:09:00.
But the runner with a time of 4:08:00 finished several hundred places ahead of the runner who finished in 4:09:00 — surely that counts for something! Not really, I’d say. We can quantify the difference, both in absolute terms and in relative position, but these differences are not large enough to be meaningful.
The same is true of the information in the Teacher Data Reports recently released in New York City. Small differences in the estimated effects of teachers on their students’ achievement can appear to be much larger, because most teachers are about equally successful with the assortment of students they teach in a given year, regardless of whether those students begin the year as low-achievers or high-achievers. A trivial difference can appear much larger than it actually is, because, like the marathoners, many teachers are “crossing the finish line” at about the same time.
Here’s an example drawn from the 2008-09 Teacher Data Reports. (I chose the example because it’s convenient and have no reason to believe it’s unusual.) In 2009, fifth-graders took New York State’s English Language Arts exam, which consisted of 24 multiple-choice test items and three constructed-response items, which were combined to create a raw score ranging from 0 to 31. The raw scores were then converted to scale scores, which were used to classify individual students into four levels of performance, with Level 3 representing grade-level proficiency. The average student in New York City got 25.5 raw score points out of 31, which in New York City’s scheme represented an average proficiency level of 3.29. (Sounds pretty good, right? Of course, this was before the state wised up that being proficient on the test didn’t mean a student was on track to graduate from high school ready for college.)
The logic of the city’s Teacher Data Reports is to estimate Teacher A’s contribution to his or her students’ test-scores by comparing how other students with the same measured characteristics would be expected to do on the state test, based on their prior achievement and individual and classroom characteristics, with how Teacher A’s students actually did on the test. If Teacher A’s students score at the same level as was predicted by a statistical model, Teacher A is claimed to not “add value” to her students. If Teacher B’s students perform better than expected, Teacher B is said to add value. (And poor Teacher C, whose students score lower than they are predicted to do, is subtracting value, I guess. Maybe we should call him Teacher F.) These “value-added” scores are then ranked, and a teacher is assigned a percentile value representing the percentage of other teachers teaching the same grade and subject who scored below he or she did.
An “average” teacher, according to this calculation, is one whose value-added score is 0. Of the 1,751 NYC teachers with three or more years of experience who received a value-added rating in fifth-grade English in 2008-09, 84 got a score that rounded to .00. Their percentile ratings—the number that’s getting all of the attention in the traditional and neo-tabloids—range from 53 to 58. A tiny shift of .01 in either direction yields an additional 152 teachers, and a percentile rating of 48 to 63. What seems to be a small range of value-added scores could be anywhere from the 48th to the 63th percentile, because the value-added scores in this range are clumped together.
But it’s hard to know whether a shift of .01 in either direction is large or small. How can we tell? Here’s an idea. Suppose that Ruiz had 20 students who took the fifth-grade English test in 2009, and they were at the city average of 25.5 out of 31 raw score points on the test. What if half of the students got one more question right on the test? Doesn’t seem like a big stretch, does it? Just like the variation in the conditions on marathon day, half of the students getting one more question correct on a given test on a given day doesn’t seem out of the realm of possibility.
If this were to happen, Ruiz’s value-added score would rise from 0 to .05. And the percentile range associated with a value-added score of .05 is 75 to 77. All of a sudden, an “average” teacher looks pretty good. And this isn’t due to the margin of error! It’s just because many teachers are about equally effective in promoting student achievement, according to the value-added model in use. A relatively small change in student performance shifts a teacher’s location in the value-added distribution by a surprisingly large amount.
To be sure, this example is based on one year of student test-score data, not multiple years. But that’s what New York State is proposing to rely on in its first year of the new Annual Professional Performance Review process, and it’s what other jurisdictions, such as Washington, D.C., use in their teacher-evaluation systems. And, as with the marathon, the clumping together of teachers is more of an issue in the middle of the distribution than among those in the lead or at the back of the pack. But that’s little consolation to the teachers whose percentile rankings will figure into annual evaluations that will determine whether they’re permitted to continue teaching.
Speaking at Coney Island Feb. 28, Mayor Bloomberg defiantly affirmed the public’s right to know the contents of teachers’ performance evaluations. “Parents have a right to know every bit of information that we can possibly collect about the teacher that’s in front of their kids,” he said.
That statement is utterly ridiculous. There’s no legitimate interest in information about teachers’ private lives if it has no bearing on their professional performance. But here’s something parents do have the right to know: just how fragile value-added measures based on the New York State testing system are. The New York State tests were never intended to be used to rate teachers’ contributions to student learning — and so it’s little wonder they do a pretty poor job of it.
This post also appears on Eye on Education, Aaron Pallas’s Hechinger Report blog.