Dynamics of teacher value added – the limits

In my last post, I showed that culling low-performance teachers can work surprisingly well, even in the presence of noise that’s as large as the signal.

However, that involved two big assumptions: the labor pool of teachers is unlimited with respect to the district’s needs, and there’s no feedback from the evaluation process to teacher quality and retention. Consider the following revised system structure:

In this view, there are several limitations to the idea of firing bad teachers to improve performance:

  • Fired teachers don’t just disappear into a cloud; they go back into the teacher labor pool. This means that, as use of VA evaluation increases, the mean quality of the labor pool goes down, making it harder to replace teachers with new and better ones. This is unambiguously a negative (balancing) feedback loop.
  • The quality of the labor pool could go up through a similar culling process, but it’s not clear that teacher training institutions can deliver 50% more or better candidates, or that teachers rejected for low value added in one district will leave the pool altogether.

Several effects have ambiguous sign – they help (positive/reinforcing feedback) if the measurement system is seen as fair and attractive to good teachers, but they hurt performance otherwise:

  • Increased use of VA changes the voluntary departure rate of teachers from the district, with different effects on good and bad teachers.
  • Increased use of VA changes the ease of hiring good teachers.
  • Increased use of VA attracts more/better teachers to the labor pool, and reduces attrition from the labor pool.

On balance, I’d guess that these are currently inhibiting performance. Value added measurement is widely perceived as noisy and arbitrary, and biased toward standardized learning goals that aren’t all that valuable or fun to teach to.

  • Increasing the rate of departure requires a corresponding increase in the hiring rate, but this is not free, and there’s no guarantee that the labor pool supports it.

There are some additional limiting loops implicit in Out with the bad, in with the good.

Together, I think these effects most likely limit the potential for Value Added hiring/firing decisions to improve performance rather severely, especially given the current resistance to and possible problems with the measurements.

Dynamics of teacher value added

Suppose for the sake of argument that (a) maximizing standardized test scores is what we want teachers to do and (b) Value Added Modeling (VAM) does in fact measure teacher contributions to scores, perhaps with jaw-dropping noise, but at least no systematic bias.

Jaw-dropping noise isn’t as bad as it sounds. Other evaluation methods, like principal evaluations, aren’t necessarily less random, and if anything are more subject to various unknown biases. (Of course, one of those biases might be a desirable preference for learning not captured by standardized tests, but I won’t go there.) Also, other parts of society, like startup businesses, are subjected to jaw-dropping noise via markets, yet the economy still functions.

Further, imagine that we run a district with 1000 teachers, 10% of whom quit in a given year. We can fire teachers at will on the basis of low value added scores. We might not literally fire them; we might just deny them promotions or other benefits, thus encouraging them to leave. We replace teachers by hiring, and get performance given by a standard normal distribution (i.e. performance is an abstract index, ~ N(0,1)). We measure performance each year, with measurement error that’s as large as the variance in performance (i.e., measured VA = true VA + N(0,1)).

Structure of the system described. Note that this is essentially a discrete event simulation. Rather than a stock of teachers, we have an array of 1000 teacher positions, with each teacher represented by a performance score (“True VA”).

With such high noise, does VAM still work? The short answer is yes, if you don’t mind the side effects, and live in an open system.

If teachers depart at random, average performance across the district will be distributed N(0,.03); the large population of teachers smooths the noise inherited from the hiring process. Suppose, on top of that, that we begin to cull the bottom-scoring 5% of teachers each year. 5% doesn’t sound like a lot, but it probably is. For example, you’d have to hold a tenure review (or whatever) every 4 years and cut one in 5 teachers. Natural turnover probably isn’t really as high as 10%, but even so, this policy would imply a 50% increase in hiring to replace the greater outflow. Then suppose we can increase the accuracy of measurement from N(0,1) to N(0,0.5).

What happens to performance? It goes up quite a bit:

In our scenario (red), the true VA of teachers in the district goes up by about .35 standard deviations eventually. Note the eventually: quality is a stock, and it takes time to fill it up to a new equilibrium level. Initially, it’s easy to improve performance, because there’s low-hanging fruit – the bottom 5% of teachers is solidly poor in performance. But as performance improves, there are fewer poor performers, and it’s tougher to replace them with better new hires.

Surprisingly, doubling the accuracy of measurements (green) or making them perfect (gray) doesn’t increase performance much further. On the other hand, if noise exceeds the signal, ~N(0,5), performance is no longer increased much (black):

Extreme noise defeats the selection process, because firing becomes essentially random. There’s no expectation that a randomly-fired teacher can be replaced with a better randomly-hired teacher.

While aggregate performance goes up in spite of a noisy measurement process, the cost is a high chance of erroneously firing teachers, because their measured performance is in the bottom 5%, but their true performance is not. This is akin to the fundamental tradeoff between Type I and Type II errors in statistics. In our scenario (red), the error rate is about 70%, i.e. 70% of teachers fired aren’t truly in the bottom 5%:

This means that, while evaluation errors come out in the wash at the district system level, they fall rather heavily on individuals. It’s not quite as bad as it seems, though. While a high fraction of teachers fired aren’t really in the bottom 5%, they’re still solidly below average. However, as aggregate performance rises, the false-positive firings get worse, and firings increasingly involve teachers near the middle of the population in performance terms:

Next post: why all of this this is limited by feedback.


More NYC teacher VAM mysteries

I can’t resist a dataset. So, now that I have the NYC teacher value added modeling results, I have to keep picking at it.

The 2007-2008 results are in a slightly different format from the later years, but contain roughly the same number of teacher ratings (17,000) and have lots of matching names, so at first glance the data are ok after some formatting. However, it turns out that, unlike 2008-2010, they contain percentile ranks that are nonuniformly distributed (which should be impossible). They also include values of both 0 and 100 (normally, percentiles are reported 1 to 100 or 0 to 99, but not including both endpoints, so that there are 100 rather than 101 bins). <sound of balled up spreadsheet printout ricocheting around inside metal wastebasket>

Nonuniform distribution of percentile ranks for 2007-2008 school year, for 10 subject-grade combinations.

That leaves only two data points: 2008-2009 and 2009-2010. That’s not much to go on for assessing the reliability of teacher ratings, for which you’d like to have lots of repeated observations of the same teachers. Actually, in a sense there are a bit more than two points, because the data includes a multi-year rating, that includes information from intervals prior to the 2008-2009 school year for some teachers.

I’d expect the multi-year rating to behave like a Bayesian update as more data arrives. In other words, the multi-year score at (t) is roughly the multi-year score at (t-1) convolved with the single-year score for (t). If things are approximately normal, this would work like:

  • Prior: multi-year score for year (t-1), distributed N( mu, sigma/sqrt(n) ) – with mu = teacher’s true expected value added, and sigma = measurement and performance variability, incorporating n years of data
  • Data likelihood: single-year score for year (t), ~ N( mu, sigma )
  • Posterior: multi-year score for year (t), ~ N( mu, sigma/sqrt(n+1) )

So, you’d expect that the multi-year score would behave like a SMOOTH, with the estimated value adjusted incrementally toward each new single-year value observed, and the confidence bounds narrowing with sqrt(n) as observations accumulate. You’d also expect that individual years would have similar contributions to the multi-year score, except to the extent that they differ in number of data points (students & classes) and data quality, which is probably not changing much.

However, I can’t verify any of these properties:

Difference of 09-10 score from 08-09 multi-year score vs. update to multi-year score from 08-09 to 09-10. I’d expect this to be roughly diagonal, and not too noisy. However, it appears that there are a significant number of teachers for whom the multi-year score goes down, despite the fact that their annual 09-10 score exceeds their prior 08-09 multi-year score (and vice versa). This also occurs in percentiles. This is 4th grade English, but other subject-grade combinations appear similar.

Plotting single-year scores for 08-09 and 09-10 against the 09-10 multi-year score, it appears that the multi-year score is much better correlated with 09-10, which would seem to indicate that 09-10 has greater leverage on the outcome. Again, his is 4th grade English, but generalizes.

Percentile range (confidence bounds) for multi-year rank in 08-09 vs. 09-10 school year, for teachers in the 40th-59th percentile in 08-09. Ranges mostly shrink, but not by much.

I hesitate to read too much into this, because it’s possible that (a) the FOI datasheets are flawed, (b) I’m misinterpreting the data, which is rather sketchily documented, or (c) in haste, I’ve just made a total hash of this analysis. But if none of those things are true, then it would seem that the properties of this measurement system are not very desirable. It’s just very weird for a teacher’s multi-year score to go up when his single-year score goes down; a possible explanation could be numerical instability of the measurement process. It’s also strange for confidence bounds to widen, or narrow hardly at all, in spite of a large injection of data; that suggests that there’s very little incremental information in each school year. Perhaps one could construct some argument about non-normality of the data that would explain things, but that might violate the assumptions of the estimates. Or, perhaps it’s some artifact of the way scores are normalized. Even if this is a true and proper behavior of the estimate, it gives the measurement system a face validity problem. For the sake of NYC teachers, I hope that it’s (c).

Teacher value added modeling

The vision of teacher value added modeling (VAM) is a good thing: evaluate teachers based on objective measures of their contribution to student performance. It may be a bit utopian, like the cybernetic factory, but I’m generally all for substitution of reason for baser instincts. But a prerequisite for a good control system is a good model connected to adequate data streams. I think there’s reason to question whether we have these yet for teacher VAM.

The VAM models I’ve seen are all similar. Essentially you do a regression on student performance, with a dummy for the teacher, and as many other explanatory variables as you can think of. Teacher performance is what’s left after you control for demographics and whatever else you can think of. (This RAND monograph has a useful summary.)

Right away, you can imagine lots of things going wrong. Statistically, the biggies are omitted variable bias and selection bias (because students aren’t randomly assigned to teachers). You might hope that omitted variables come out in the wash for aggregate measurements, but that’s not much consolation to individual teachers who could suffer career-harming noise. Selection bias is especially troubling, because it doesn’t come out in the wash. You can immediately think of positive-feedback mechanisms that would reinforce the performance of teachers who (by mere luck) perform better initially. There might also be nonlinear interaction affects due to classroom populations that don’t show up as the aggregate of individual student metrics.

On top of the narrow technical issues are some bigger philosophical problems with the measurements. First, they’re just what can be gleaned from standardized testing. That’s a useful data point, but I don’t think I need to elaborate on its limitations. Second, the measurement is a one-year snapshot. That means that no one gets any credit for building foundations that enhance learning beyond a single school year. We all know what kind of decisions come out of economic models when you plug in a discount rate of 100%/yr.

The NYC ed department claims that the models are good:

Q: Is the value-added approach reliable?

A: Our model met recognized standards for validity and reliability. Teachers’ value-added scores were positively correlated with school Progress Report scores and principals’ evaluations of teacher effectiveness. A teacher’s value-added score was highly stable from year to year, and the results for teachers in the top 25 percent and bottom 25 percent were particularly stable.

That’s odd, because independent analysis by Gary Rubinstein of FOI released data indicates that scores are highly unstable. I found that hard to square with the district’s claims about the model, above, so I did my own spot check:

Percentiles are actually not the greatest measure here, because they throw away a lot of information about the distribution. Also, the points are integers and therefore overlap. Here are raw z-scores:

Some things to note here:

  • There is at least some information here.
  • The noise level is very high.
  • There’s no visual evidence of the greater reliability in the tails cited by the district. (Unless they’re talking about percentiles, in which case higher reliability occurs almost automatically, because high ranks can only go down, and ranking shrinks the tails of the distribution.)

The model methodology is documented in a memo. Unfortunately, it’s a typical opaque communication in Greek letters, from one statistician to another. I can wade through it, but I bet most teachers can’t. Worse, it’s rather sketchy on model validation. This isn’t just research, it’s being used for control. It’s risky to put a model in such a high-stakes, high profile role without some stress testing. The evaluation of stability in particular (pg. 21) is unsatisfactory because the authors appear to have reported it at the performance category level rather than the teacher level, when the latter is the actual metric of interest, upon which tenure decisions will be made. Even at the category level, cross-year score correlations are very low (~.2-.3) in English and low (~.4-.6) in math (my spot check results are even lower).

What’s really needed here is a full end-to-end model of the system, starting with a synthetic data generator, replicating the measurement system (the 3-tier regression), and ending with a population model of teachers. That’s almost the only way to know whether VAM as a control strategy is really working for this system, rather than merely exercising noise and bias or triggering perverse side effects. The alternative (which appears to be underway) is the vastly more expensive option of experimenting with real $ and real people, and I bet there isn’t adequate evaluation to assess the outcome properly.

Because it does appear that there’s some information here, and the principle of objective measurement is attractive, VAM is an experiment that should continue. But given the uncertainties and spectacular noise level in the measurements, it should be rolled out much more gradually. It’s bonkers for states to hang 50% of a teacher’s evaluation on this method. It’s quite ironic that states are willing to make pointed personnel decisions on the basis of such sketchy information, when they can’t be moved by more robust climate science.

Really, the thrust here ought to have two prongs. Teacher tenure and weeding out the duds ought to be the smaller of the two. The big one should be to use this information to figure out what makes better teachers and classrooms, and make them.

Why learn calculus?

A young friend asked, why bother learning calculus, other than to get into college?

The answer is that calculus holds the keys to the secrets of the universe. If you don’t at least have an intuition for calculus, you’ll have a harder time building things that work (be they machines or organizations), and you’ll be prey to all kinds of crank theories. Of course, there are lots of other ways to go wrong in life too. Be grumpy. Don’t brush your teeth. Hang out in casinos. Wear white shoes after Labor Day. So, all is not lost if you don’t learn calculus. However, the world is less mystifying if you do.

The amazing thing is, calculus works. A couple of years ago, I found my kids busily engaged in a challenge, using a sheet of tinfoil of some fixed size to make a boat that would float as many marbles as possible. They’d managed to get 20 or 30 afloat so far. I surreptitiously went off and wrote down the equation for the volume of a rectangular prism, subject to the constraint that its area not exceed the size of the foil, and used calculus to maximize. They were flabbergasted when I managed to float over a hundred marbles on my first try.

The secrets of the universe come in two flavors. Mathematically, those are integration and differentiation, which are inverses of one another.

Continue reading “Why learn calculus?”

Cool videos of dynamics

I just discovered the Harvard Natural Sciences Lecture Demonstrations – a catalog of ways to learn and play with science. It’s all fun, but a few of the videos provide nice demonstrations of dynamic phenomena.

Here’s a pretty array of pendulums of different lengths and therefore different natural frequencies:

This is a nice demonstration of how structure (length) causes behavior (period of oscillation). You can also see a variety of interesting behavior patterns, like beats, as the oscillations move in and out of phase with one another.

Synchronized metronomes:

These metronomes move in and out of sync as they’re coupled and uncoupled. This is interesting because it’s a fundamentally nonlinear process. Sync provides a nice account of such things, and there’s a nifty interactive coupled pendulum demo here.

Mousetrap fission:

This is a physical analog of an infection model or the Bass diffusion model. It illustrates shifting loop dominance – initially, positive feedback dominates due to the chain reaction of balls tripping new traps, ejecting more balls. After a while, negative feedback takes over as the number of live traps is depleted, and the reaction slows.

A few parts per million


There’s a persistent rumor that CO2 concentrations are too small to have a noticeable radiative effect on the atmosphere. (It appears here, for example, though mixed with so much other claptrap that it’s hard to wrap your mind around the whole argument – which would probably cause your head to explode due to an excess of self-contradiction anyway.)

To fool the innumerate, one must simply state that CO2 constitutes only about 390 parts per million, or .039%, of the atmosphere. Wow, that’s a really small number! How could it possibly matter? To be really sneaky, you can exploit stock-flow misperceptions by talking only about the annual increment (~2 ppm) rather than the total, which makes things look another 100x smaller (apparently a part of the calculation in Joe Bastardi’s width of a human hair vs. a 1km bridge span).

Anyway, my kids and I got curious about this, so we decided to put 390ppm of food coloring in a glass of water. Our precision in shaving dye pellets wasn’t very good, so we actually ended up with about 450ppm. You can see the result above. It’s very obviously blue, in spite of the tiny dye concentration. We think this is a conservative visual example, because a lot of the tablet mass was apparently a fizzy filler, and the atmosphere is 1000 times less dense than water, but effectively 100,000 times thicker than this glass. However, we don’t know much about the molecular weight or radiative properties of the dye.

This doesn’t prove much about the atmosphere, but it does neatly disprove the notion that an effect is automatically small, just because the numbers involved sound small. If you still doubt this, try ingesting a few nanograms of the toxin infused into the period at the end of this sentence.

How Many Pairs of Rabbits Are Created by One Pair in One Year?

The Fibonacci numbers are often illustrated geometrically, with spirals or square tilings, but the nautilus is not their origin. I recently learned that the sequence was first reported as the solution to a dynamic modeling thought experiment, posed by Leonardo Pisano (Fibonacci) in his 1202 masterpiece, Liber Abaci.

How Many Pairs of Rabbits Are Created by One Pair in One Year?

A certain man had one pair of rabbits together in a certain enclosed place, and one wishes to know how many are created from the pair in one year when it is the nature of them in a single month to bear another pair, and in the second month those born to bear also. Because the abovewritten pair in the first month bore, you will double it; there will be two pairs in one month. One of these, namely the first, bears in the second month, and thus there are in the second month 3 pairs; of these in one month two are pregnant, and in the third month 2 pairs of rabbits are born, and thus there are 5 pairs in the month; in this month 3 pairs are pregnant, and in the fourth month there are 8 pairs, of which 5 pairs bear another 5 pairs; these are added to the 8 pairs making 13 pairs in the fifth month; these 5 pairs that are born in this month do not mate in this month, but another 8 pairs are pregnant, and thus there are in the sixth month 21 pairs; [p284] to these are added the 13 pairs that are born in the seventh month; there will be 34 pairs in this month; to this are added the 21 pairs that are born in the eighth month; there will be 55 pairs in this month; to these are added the 34 pairs that are born in the ninth month; there will be 89 pairs in this month; to these are added again the 55 pairs that are born in the tenth month; there will be 144 pairs in this month; to these are added again the 89 pairs that are born in the eleventh month; there will be 233 pairs in this month.

Source: http://www.math.utah.edu/~beebe/software/java/fibonacci/liber-abaci.html

The solution is the famous Fibonacci sequence, which can be written as a recurrent series,

F(n) = F(n-1)+F(n-2), F(0)=F(1)=1

This can be directly implemented as a discrete time Vensim model:

Fibonacci SeriesHowever, that representation is a little too abstract to immediately reveal the connection to rabbits. Instead, I prefer to revert to Fibonacci’s problem description to construct an operational representation:

Fibonacci Rabbits

Mature rabbit pairs are held in a stock (Fibonacci’s “certain enclosed space”), and they breed a new pair each month (i.e. the Reproduction Rate = 1/month). Modeling male-female pairs rather than individual rabbits neatly sidesteps concern over the gender mix. Importantly, there’s a one-month delay between birth and breeding (“in the second month those born to bear also”). That delay is captured by the Immature Pairs stock. Rabbits live forever in this thought experiment, so there’s no outflow from mature pairs.

You can see the relationship between the series and the stock-flow structure if you write down the discrete time representation of the model, ignoring units and assuming that the TIME STEP = Reproduction Rate = Maturation Time = 1:

Mature Pairs(t) = Mature Pairs(t-1) + Maturing
Immature Pairs(t) = Immature Pairs(t-1) + Reproducing - Maturing

Substituting Maturing = Immature Pairs and Reproducing = Mature Pairs,

Mature Pairs(t) = Mature Pairs(t-1) + Immature Pairs(t-1)
Immature Pairs(t) = Immature Pairs(t-1) + Mature Pairs(t-1) - Immature Pairs(t-1) = Mature Pairs(t-1)


Mature Pairs(t) = Mature Pairs(t-1) + Mature Pairs(t-2)

The resulting model has two feedback loops: a minor negative loop governing the Maturing of Immature Pairs, and a positive loop of rabbits Reproducing. The rabbit population tends to explode, due to the positive loop:

Fibonacci Growth

In four years, there are about as many rabbits as there are humans on earth, so that “certain enclosed space” better be big. After an initial transient, the growth rate quickly settles down:

Fibonacci Growth RateIts steady-state value is .61803… (61.8%/month), which is the Golden Ratio conjugate. If you change the variable names, you can see the relationship to the tiling interpretation and the Golden Ratio:

Fibonacci Part Whole

Like anything that grows exponentially, the Fibonacci numbers get big fast. The hundredth is  354,224,848,179,261,915,075.

As before, we can play the eigenvector trick to suppress the growth mode. The system is described by the matrix:

-1 1
 1 0

which has eigenvalues {-1.618033988749895, 0.6180339887498949} – notice the appearance of the Golden Ratio. If we initialize the model with the eigenvector of the negative eigenvalue, {-0.8506508083520399, 0.5257311121191336}, we can get the bunny population under control, at least until numerical noise excites the growth mode, near time 25:

Fibonacci Stable

The problem is that we need negarabbits to do it, -.850653 immature rabbits initially, so this is not a physically realizable solution (which probably guarantees that it will soon be introduced in legislation).

I brought this up with my kids, and they immediately went to the physics of the problem: “Rabbits don’t live forever. How big is your cage? Do you have rabbit food? TONS of rabbit food? What if you have all males, or varying mixtures of males and females?”

It’s easy to generalize the structure to generate other sequences. For example, assuming that mature rabbits live for only two months yields the Padovan sequence. Its equivalent of the Golden Ratio is 1.3247…, i.e. the rabbit population grows more slowly at ~32%/month, as you’d expect since rabbit lives are shorter.

The model’s in my library.