Teacher value added modeling

The vision of teacher value added modeling (VAM) is a good thing: evaluate teachers based on objective measures of their contribution to student performance. It may be a bit utopian, like the cybernetic factory, but I’m generally all for substitution of reason for baser instincts. But a prerequisite for a good control system is a good model connected to adequate data streams. I think there’s reason to question whether we have these yet for teacher VAM.

The VAM models I’ve seen are all similar. Essentially you do a regression on student performance, with a dummy for the teacher, and as many other explanatory variables as you can think of. Teacher performance is what’s left after you control for demographics and whatever else you can think of. (This RAND monograph has a useful summary.)

Right away, you can imagine lots of things going wrong. Statistically, the biggies are omitted variable bias and selection bias (because students aren’t randomly assigned to teachers). You might hope that omitted variables come out in the wash for aggregate measurements, but that’s not much consolation to individual teachers who could suffer career-harming noise. Selection bias is especially troubling, because it doesn’t come out in the wash. You can immediately think of positive-feedback mechanisms that would reinforce the performance of teachers who (by mere luck) perform better initially. There might also be nonlinear interaction affects due to classroom populations that don’t show up as the aggregate of individual student metrics.

On top of the narrow technical issues are some bigger philosophical problems with the measurements. First, they’re just what can be gleaned from standardized testing. That’s a useful data point, but I don’t think I need to elaborate on its limitations. Second, the measurement is a one-year snapshot. That means that no one gets any credit for building foundations that enhance learning beyond a single school year. We all know what kind of decisions come out of economic models when you plug in a discount rate of 100%/yr.

The NYC ed department claims that the models are good:

Q: Is the value-added approach reliable?

A: Our model met recognized standards for validity and reliability. Teachers’ value-added scores were positively correlated with school Progress Report scores and principals’ evaluations of teacher effectiveness. A teacher’s value-added score was highly stable from year to year, and the results for teachers in the top 25 percent and bottom 25 percent were particularly stable.

That’s odd, because independent analysis by Gary Rubinstein of FOI released data indicates that scores are highly unstable. I found that hard to square with the district’s claims about the model, above, so I did my own spot check:

Percentiles are actually not the greatest measure here, because they throw away a lot of information about the distribution. Also, the points are integers and therefore overlap. Here are raw z-scores:

Some things to note here:

  • There is at least some information here.
  • The noise level is very high.
  • There’s no visual evidence of the greater reliability in the tails cited by the district. (Unless they’re talking about percentiles, in which case higher reliability occurs almost automatically, because high ranks can only go down, and ranking shrinks the tails of the distribution.)

The model methodology is documented in a memo. Unfortunately, it’s a typical opaque communication in Greek letters, from one statistician to another. I can wade through it, but I bet most teachers can’t. Worse, it’s rather sketchy on model validation. This isn’t just research, it’s being used for control. It’s risky to put a model in such a high-stakes, high profile role without some stress testing. The evaluation of stability in particular (pg. 21) is unsatisfactory because the authors appear to have reported it at the performance category level rather than the teacher level, when the latter is the actual metric of interest, upon which tenure decisions will be made. Even at the category level, cross-year score correlations are very low (~.2-.3) in English and low (~.4-.6) in math (my spot check results are even lower).

What’s really needed here is a full end-to-end model of the system, starting with a synthetic data generator, replicating the measurement system (the 3-tier regression), and ending with a population model of teachers. That’s almost the only way to know whether VAM as a control strategy is really working for this system, rather than merely exercising noise and bias or triggering perverse side effects. The alternative (which appears to be underway) is the vastly more expensive option of experimenting with real $ and real people, and I bet there isn’t adequate evaluation to assess the outcome properly.

Because it does appear that there’s some information here, and the principle of objective measurement is attractive, VAM is an experiment that should continue. But given the uncertainties and spectacular noise level in the measurements, it should be rolled out much more gradually. It’s bonkers for states to hang 50% of a teacher’s evaluation on this method. It’s quite ironic that states are willing to make pointed personnel decisions on the basis of such sketchy information, when they can’t be moved by more robust climate science.

Really, the thrust here ought to have two prongs. Teacher tenure and weeding out the duds ought to be the smaller of the two. The big one should be to use this information to figure out what makes better teachers and classrooms, and make them.

Alternate perceptions of time

An interesting tidbit from Science:

Where Time Goes Up and Down

Dennis Normile

In Western cultures, the future lies ahead; the past is behind us. These notions are embedded in both gestures and spoken metaphors (looking forward to next year or back over the past year). A forward hand motion typically accompanies talk of the future; references to the past often bring a wave over the shoulder.

It is hard for most Westerners to conceive of other ways of conceptualizing time. But in 2006, Rafael Núñez, a cognitive scientist at the University of California, San Diego, reported that for the Aymara, an ethnic group of about 2 million people living in the Andean highlands, in both spoken and gestural terms, the future is unseen and conceived as being behind the speaker; the past, since it has been witnessed, is in front. They point behind themselves when discussing the future. And when talking about the past, Aymara gesture farther in front of them the more distant the event ….

At the Tokyo Evolutionary Linguistics Forum, Núñez presented another example of unusual thinking—and gesturing—about time: The Yupno people, who inhabit a remote valley in Papua New Guinea, think of time topographically. No matter which way a speaker is facing, he or she will gesture uphill when discussing the future and point downhill when talking about the past. …

I like the Aymara approach, with the future unseen behind the speaker. I bet there aren’t any Aymara economic models assuming perfect foresight as a model of behavior.

Where Time Goes Up and Down

Doing quality simulation research

Unless the Journal of Irreproducible Results is your target, you should check out this paper:

Rahmandad, H., Sterman J., (forthcoming). Reporting Guidelines for Simulation-based Research in Social Sciences

Abstract: Reproducibility of research is critical for the healthy growth and accumulation of reliable knowledge, and simulation-based research is no exception. However, studies show many simulation-based studies in the social sciences are not reproducible. Better standards for documenting simulation models and reporting results are needed to enhance the reproducibility of simulation-based research in the social sciences. We provide an initial set of Reporting Guidelines for Simulation-based Research (RGSR) in the social sciences, with a focus on common scenarios in system dynamics research. We discuss these guidelines separately for reporting models, reporting simulation experiments, and reporting optimization results. The guidelines are further divided into minimum and preferred requirements, distinguishing between factors that are indispensable for reproduction of research and those that enhance transparency. We also provide a few guidelines for improved visualization of research to reduce the costs of reproduction. Suggestions for enhancing the adoption of these guidelines are discussed at the end.

I should add that this advice isn’t just for the social sciences, nor just for research. Business and public policy models developed by consultants should be no less replicable, even if they remain secret. This is not only a matter of intellectual honesty; it’s a matter of productivity (documented components are easier to reuse) and learning (if you don’t keep track of what you do, you can’t identify and learn from mistakes when reality evolves away from your predictions).

This reminds me that I forgot to plug my annual advice on good writing for the SD conference:

I’m happy to report that the quality of papers in the thread I see was higher than usual (or at least the variance was lower – no plenary blockbuster, but also no dreadful, innumerate, ungrammatical horrors to wade through).

The vicious cycle of ignorance

XKCD:

XKCD - Forgot Algebra

Here’s my quick take on the feedback structure behind this:

Knowledge of X (with X = algebra, cooking, …) is at the heart of a nest of positive feedback loops that make learning about X subject to vicious or virtuous cycles.

  • The more you know about X, the more you find opportunities to use it, and vice versa. If you don’t know calculus, you tend to self-select out of engineering careers, thus fulfilling the “I’ll never use it” prophecy. Through use, you learn by doing, and gain further knowledge.
  • Similarly, the more use you get out of X, the more you perceive it as valuable, and the more motivated you are to learn about it.
  • When you know more, you also may develop intrinsic interest or pride of craft in the topic.
  • When you confront some external standard for knowledge, and find yourself falling short, cognitive dissonance can kick in. Rather than thinking, “I really ought to up my game a bit,” you think, “algebra is for dorks, and those pointy-headed scientists are just trying to seize power over us working stiffs.”

I’m sure this could be improved on, for example by recognizing that attitudes are a stock.

Still, it’s easy to see here how algebra education goes wrong. In school, the red and green loops are weak, because there’s typically no motivating application more compelling than a word problem. Instead, there’s a lot of reliance on external standards (grades and testing), which encourages resistance.

A possible remedy therefore is to drive education with real-world projects, so that algebra emerges as a tool with an obvious need, emphasizing the red and green loops over the blue. An interesting real-world project might be self-examination of the role of the blue loop in our lives.

 

 

Bathtub Statistics

The pitfalls of pattern matching don’t just apply to intuitive comparisons of the behavior of associated stocks and flows. They also apply to statistics. This means, for example, that a linear regression like

stock = a + b*flow + c*time + error

is likely to go seriously wrong. That doesn’t stop such things from sneaking into the peer reviewed literature though. A more common quasi-statistical error is to take two things that might be related, measure their linear trends, and declare the relationship falsified if the trends don’t match. This bogus reasoning remains a popular pastime of climate skeptics, who ask, how could temperature go down during some period when emissions went up? (See this example.) This kind of naive naive statistical reasoning, with static mental models of dynamic phenomena, is hardly limited to climate skeptics though.

Given the dynamics, it’s actually quite easy to see how such things can occur. Here’s a more complete example of a realistic situation:

At the core, we have the same flow driving a stock. The flow is determined by a variety of test inputs , so we’re still not worrying about circular causality between the stock and flow. There is potentially feedback from the stock to an outflow, though this is not active by default. The stock is also subject to other random influences, with a standard deviation given by Driving Noise SD. We can’t necessarily observe the stock and flow directly; our observations are subject to measurement error. For purposes that will become evident momentarily, we might perform some simple manipulations of our measurements, like lagging and differencing. We can also measure trends of the stock and flow. Note that this still simplifies reality a bit, in that the flow measurement is instantaneous, rather than requiring its own integration process as physics demands. There are no complications like missing data or unequal measurement intervals.

Now for an experiment. First, suppose that the flow is random (pink noise) and there are no measurement errors, driving noise, or outflows. In that case, you see this:

Once could actually draw some superstitious conclusions about the stock and flow time series above by breaking them into apparent episodes, but that’s quite likely to mislead unless you’re thinking explicitly about the bathtub. Looking at a stock-flow scatter plot, it appears that there is no relationship:

Of course, we know this is wrong because we built the model with perfect Flow->Stock causality. The usual statistical trick to reveal the relationship is to undo the integration by taking the first difference of the stock data. When you do that, plotting the change in the stock vs. the flow (lagged one period to account for the differencing), the relationship reappears: Continue reading “Bathtub Statistics”

Bathtub Dynamics

Failure to account for bathtub dynamics is a basic misperception of system structure, that occurs even in simple systems that lack feedback. Research shows that pattern matching, a common heuristic, leads even highly educated people to draw incorrect conclusions about systems as simple as the entry and exit of people in a store.

This can occur in any stock-flow system, which means that it’s ubiquitous. Here’s the basic setup:

Replace “Flow” and “Stock” with your favorite concepts – income and bank balance, sales rate and installed base, births and rabbits, etc. Obviously the flow causes the stock – by definition, the flow rate is the rate of change of the stock level. There is no feedback here; just pure integration, i.e. the stock accumulates the flow.

The pattern matching heuristic attempts to detect causality, or make predictions about the future, by matching the temporal patterns of cause and effect. So, naively, a pattern matcher expects to see a step in the stock in response to a step in the flow. But that’s not what happens:

Pattern matching fails because we shouldn’t expect the patterns to match through an integration. Above, the integral of the step ( flow = constant ) is a ramp ( stock = constant * time ). Other patterns are possible. For example, a monotonically decreasing cause (flow) can yield an increasing effect (stock), or even nonmonotonic behavior if it crosses zero: Continue reading “Bathtub Dynamics”

Unskeptical skepticism

Atmospheric CO2 doesn’t drive temperature, and temperature doesn’t drive CO2. They drive each other, in a feedback loop. Each relationship involves integration – CO2 accumulates temperature changes through mechanisms like forest growth and ocean uptake, and temperature is the accumulation of heat flux controlled by the radiative effects of CO2.

This has been obvious for at least decades, yet it still eludes many. A favorite counter-argument for an influence of CO2 on temperature has long been the observation that temperature appears to lead CO2 at turning points in the ice core record. Naively, this violates the requirement for establishing causality, that cause must precede effect. But climate is not a simple system with binary states and events, discrete time and single causes. In a feedback system, the fact that X lags Y by some discernible amount doesn’t rule out an influence of Y on X; in fact such bidirectional causality is essential for simple oscillators.

A newish paper by Shakun et al. sheds some light on the issue of ice age turning points. It turns out that much of the issue is a matter of data – that ice core records are not representative of global temperatures. But it still appears that CO2 is not the triggering mechanism for deglaciation. The authors speculate that the trigger is northern hemisphere temperatures, presumably driven by orbital insolation changes, followed by changes in ocean circulation. Then CO2 kicks in as amplifier. Simulation backs this up, though it appears to me from figure 3 that models capture the qualitative dynamics, but underpredict the total variance in temperature over the period. To me, this is an interesting step toward a more complete understanding of ice age terminations, but I’ll wait for a few more papers before accepting declarations of victory on the topic.

Predictably, climate skeptics hate this paper. For example, consider Master Tricksed Us! at WattsUpWithThat. Commenters positively drool over the implication that Shakun et al. “hid the incline” by declining to show the last 6000 years for proxy temperature/CO2 relationship.

I leave the readers to consider the fact that for most of the Holocene, eight millennia or so, half a dozen different ice core records say that CO2 levels were rising pretty fast by geological standards … and despite that, the temperatures have been dropping over the last eight millennia …

But not so fast. First, there’s no skepticism about the data. Perhaps Shakun et al. omitted the last 6k years for a good reason, like homogeneity. A spot check indicates that there might be issues – series MD95-2037 ends in the year 6838 BP, for example. So, perhaps the WUWT graph merely shows spatial selection bias in the dataset. Second, the implication that rising CO2 and falling temperatures somehow disproves a CO2->temperature link is yet another failure to appreciate bathtub dynamics and multivariate causality.

This credulous fawning over the slightest hint of a crack in mainstream theory strikes me as the opposite of skepticism. The essence of a skeptical attitude, I think, is to avoid early lock-in to any one pet theory or data stream. Winning theories emerge from testing lots of theories against lots of constraints. That requires continual questioning of models and data, but also questioning of the questions. Objections that violate physics like accumulation, or heaps of mutually exclusive objections, have to be discarded like any other failed theory. The process should involve more than fools asking more questions than a wise man can answer. At the end of the day, “no theory is possible” is itself a theory that implies null predictions that can be falsified like any other, if it’s been stated explicitly enough.

Burt Rutan's climate causality confusion

I’ve always thought Burt Rutan was pretty cool, so I was rather disappointed when he signed on to a shady climate op-ed in the WSJ (along with Scott Armstrong). I was curious what Rutan’s mental model was, so I googled and found his arguments summarized in an extensive slide deck, available here.

It would probably take me 98 posts to detail the problems with these 98 slides, so I’ll just summarize a few that are particularly noteworthy from the perspective of learning about complex systems.

Data Quality

Rutan claims to be motivated by data fraud,

In my background of 46 years in aerospace flight testing and design I have seen many examples of data presentation fraud. That is what prompted my interest in seeing how the scientists have processed the climate data, presented it and promoted their theories to policy makers and the media. (here)

This is ironic, because he credulously relies on much fraudulent data. For example, slide 18 attempts to show that CO2 concentrations were actually much higher in the 19th century. But that’s bogus, because many of those measurements were from urban areas or otherwise subject to large measurement errors and bias. You can reject many of the data points on first principles, because they imply physically impossible carbon fluxes (500 billion tons in one year).

Slides 32-34 also present some rather grossly distorted comparisons of data and projections, complete with attributions of temperature cycles that appear to bear no relationship to the data (Slide 33, right figure, red line).

Slides 50+ discuss the urban heat island effect and surfacestations.org effort. Somehow they neglect to mention that the outcome of all of that was a cool bias in the data, not a warm bias.

Bathtub Dynamics

Slides 27 and 28 seek a correlation between the CO2 and temperature time series. Failure is considered evidence that temperature is not significantly influenced by CO2. But this is a basic failure to appreciate bathtub dynamics. Temperature is an indicator of the accumulation of heat. Heat integrates radiative flux, which depends on GHG concentrations. So, even in a perfect system where CO2 is the only causal influence on temperature, we would not expect to see matching temporal trends in emissions, concentrations, and temperatures. How do you escape engineering school and design airplanes without knowing about integration?

Correlation and causation

Slide 28 also engages in the fallacy of the single cause and denying the antecedent. It proposes that, because warming rates were roughly the same from 1915-1945 and 1970-2000, while CO2 concentrations varied, CO2 cannot be the cause of the observations. This of course presumes (falsely) that CO2 is the only influence on temperatures, neglecting volcanoes, endogenous natural variability, etc., not to mention blatantly cherry-picking arbitrary intervals.

Slide 14 shows another misattribution of single cause, comparing CO2 and temperature over 600 million years, ignoring little things like changes in the configuration of the continents and output of the sun over that long period.

In spite of the fact that Rutan generally argues against correlation as evidence for causation, Slide 46 presents correlations between orbital factors and sunspots (the latter smoothed in some arbitrary way) as evidence that these factors do drive temperature.

Feedback

Slide 29 shows temperature leading CO2 in ice core records, concluding that temperature must drive CO2, and not the reverse. In reality, temperature and CO2 drive one another in a feedback loop. That turning points in temperature sometimes lead turning points in CO2 does not preclude CO2 from acting as an amplifier of temperature changes. (Recently there has been a little progress on this point.)

Too small to matter

Slide 12 indicates that CO2 concentrations are too small to make a difference, which has no physical basis, other than the general misconception that small numbers don’t matter.

Computer models are not evidence

So Rutan claims on slide 47. Of course this is true in a trivial sense, because one can always build arbitrary models that bear no relationship to anything.

But why single out computer models? Mental models and pencil-and-paper calculations are not uniquely privileged. They are just as likely to fail to conform to data, laws of physics, and rules of logic as a computer model. In fact, because they’re not stated formally, testable automatically, or easily shared and critiqued, they’re more likely to contain some flaws, particularly mathematical ones. The more complex a problem becomes, the more the balance tips in favor of formal (computer) models, particularly in non-experimental sciences where trial-and-error is not practical.

There’s also no such thing as model-free inference. Rutan presents many of his charts as if data speaks for itself. In fact, no measurements can be taken without a model of the underlying process to be measured (in a thermometer, the thermal expansion of a fluid). More importantly, event the simplest trend calculation or comparison of time series implies a model. Leaving that model unstated just makes it easier to engage in bathtub fallacies and other errors in reasoning.

The bottom line

The problem here is that Rutan has no computer model. So, he feels free to assemble a dog’s breakfast of data, sourced from illustrious scientific institutions like the Heritage Foundation (slide 12), and call it evidence. Because he skips the exercise of trying to put everything into a rigorous formal feedback model, he’s received no warning signs that he has strayed far from reality.

I find this all rather disheartening. Clearly it is easy for a smart, technical person to be wildly incompetent outside his original field of expertise. But worse, it’s even easy for them to assemble a portfolio of pseudoscience that looks like evidence, capitalizing on past achievements to sucker a loyal following.

Strange times for Europe's aviation carbon tax

The whole global climate negotiation process is a bit of a sideshow, in that negotiators don’t have the freedom to actually agree to anything meaningful. When they head to Poznan, or Copenhagen, or Durban, they get their briefings from finance and economic ministries, not environment ministries. The mandates are evidently that there’s no way most countries will agree to anything like the significant emissions cuts needed to achieve stabilization.

That’s particularly clear at the moment, with Europe imposing a carbon fee on flights using their airspace, and facing broad opposition. And what opponent makes the biggest headlines? India’s environment minister – possibly the person on the planet who should be happiest to see any kind of meaningful emissions policy anywhere.

Clearly, climate is not driving the bus.