## Integration & Correlation

Claims by AI chatbots, engineers and Nobel prize winners notwithstanding, absence of correlation does not prove absence of causation, any more than presence of correlation proves presence of causation. Bard outlines several reasons from noise and nonlinearity, but missed a key one: bathtub statistics.

Here’s a really simple example of how this reasoning can go wrong. Consider a system with a stock Y(t) that integrates a flow X(t):

X(t) = -t

Y(t) = ∫X(t)dt

We don’t need to simulate to solve for Y(t) = -1/2*t^2 +C.

Over the interval t=[-1,1] the X and Y time series look like this:

The X-Y relationship is parabolic, with correlation zero:

Zero correlation can’t mean “not causal” because we constructed the system to be causal. Even worse, the sign of the relationship depends on the subset of the interval you examine:

This is not the only puzzling case. Consider instead:

X(t) = 1

Y(t) = ∫X(t)dt = t + C

In this case, X(t) has zero variance. But Corr(X,Y) = Cov(X,Y)/σ(X)σ(Y) which is 0/0. What are we to make of that?

This pathology can also arise from feedback. Consider a thermostat that controls a heater that operates in two states (on or off). If the heater is fast, and the thermostat is sensitive with a narrow temperature band, then σ(temperature) will be near 0, even though the heater is cycling with σ(heater state)>0.

## Thyroid Dynamics: Noise

A couple weeks ago I wrote about the perceptual challenges of managing thyroid stimulating hormone (TSH), which has an exponential response to the circulating thyroid hormones (T3 & T4) you’d actually like to control.

Another facet of the thyroid control problem is noise. Generally, uncertainty in measurements is not made available to users. For example, the lab results reported by MyChart have no confidence bounds: If you start looking for information on these tests, you’ll usually find precision estimates that sound pretty good – typically 5 to 7% error. (Example.) However, this understates the severity of the problem.

It’s well known that individual variation in the TSH<->T3,T4 setpoint is large, and the ATA guidelines mention this, if you read the detailed discussion. However, this is presented as a reason for the superiority of TSH measurements, “The logarithmic relationship between TSH and thyroid hormone bestows sensitivity: even if circulating T3 and T4 are in the normal range, it cannot be assumed that the subject is euthyroid. The interindividual ranges for T3 and T4 are much broader than the individual variance (), such that measuring T3 and T4 is a suboptimal way to assess thyroid status.” The control implications of variation over time within an individual are not mentioned.

The issue we face in our N=1 sample is unexplained longitudinal variation around the setpoint. In our data, this is HUGE. At a given dose, even during a long period of stability, variation in TSH is not 10%; it’s a factor of 10.

Now consider the problem facing a doc trying to titrate your dose in a 30-minute visit. They tested your TSH, and it’s 4, or .4, right at the high or low end o the recommended range. Should they adjust the dose? (The doc’s problem is actually harder than the data presented above suggests, because they never see this much data – changes in providers, labs and systems truncate the available information to just a few points.) In our experience, 3 out of 5 doctors do change the dose, even though the confidence bounds on these measurements are probably big enough to sail the Exxon Valdez through.

Individuals exhibit fluctuations in the concentration of serum thyroid-stimulating hormone (TSH) over time. The scale of these variations ranges from minutes to hours, and from months to years. The main factors contributing to the observed within-person fluctuations in serum TSH comprise pulsatile secretion, circadian rhythm, seasonality, and ageing.

I think the right response is actually the byline of this blog: don’t just do something, stand there! If one measurement potentially has enormous variation, the first thing you should probably do is leave the dose alone and retest after a modest time. On several occasions, we have literally begged for such a retest, and been denied.

The consequence of test aversion is that we have only 20 data points over 8 years, and almost none in close proximity to one another. That makes it impossible to determine whether the variation we’re seeing is measurement error (blood draw or lab methods), fast driving noise (circadian effects), or slow trends (e.g., seasonal). I’ve been fitting models to the data for several years, but this sparsity and uncertainty gives the model fits. Here’s an example:

At the highlighted point (and half a dozen others), the model finds the data completely inexplicable. The Kalman filter moves the model dramatically towards the data (the downward spike in the red curve), but only about halfway, because the estimate yields both high measurement error and high driving noise in TSH. Because the next measurement doesn’t occur for 4 months, there’s no way to sort out which is which.

This extreme noise, plus nonlinearity previously mentioned, is really a perfect setup for errors in dose management. I’ll describe one or two in a future post.

## Climate Causality Confusion

A newish set of papers (1. Theory (preprint); 2. Applications (preprint); 3. Extension) is making the rounds on the climate skeptic sites, with – ironically – little skepticism applied.

… According to the commonly assumed causality link, increased [CO2] causes a rise in T. However, recent developments cast doubts on this assumption by showing that this relationship is of the hen-or-egg type, or even unidirectional but opposite in direction to the commonly assumed one. These developments include an advanced theoretical framework for testing causality based on the stochastic evaluation of a potentially causal link between two processes via the notion of the impulse response function. …. All evidence resulting from the analyses suggests a unidirectional, potentially causal link with T as the cause and [CO2] as the effect.

Galileo complex seeps in when the authors claim that absence of correlation or impulse response from CO2 -> temperature proves absence of causality:

Clearly, the results […] suggest a (mono-directional) potentially causal system with T as the cause and [CO2] as the effect. Hence the common perception that increasing [CO2] causes increased T can be excluded as it violates the necessary condition for this causality direction.

Unfortunately, these claims are bogus. Here’s why.

The authors estimate impulse response functions between CO2 and temperature (and back), using the following formalism:

where g(h) is the response at lag h. As the authors point out, if

the IRF is zero for every lag except for the specific lag 0, then Equation (1) becomes y(t)=bx(t-h0) +v(t). This special case is equivalent to simply correlating  y(t) with x(t-h0) at any time instance . It is easy to find (cf. linear regression) that in this case the multiplicative constant is the correlation coefficient of y(t) and  x(t-h0) multiplied by the ratio of the standard deviations of the two processes.

Now … anyone who claims to have an “advanced theoretical framework for testing causality” should be aware of the limitations of linear regression. There are several possible issues that might lead to misleading conclusions about causality.

Problem #1 here is bathtub statistics. Temperature integrates the radiative forcing from CO2 (and other things). This is not debatable – it’s physics. It’s old physics, and it’s experimental, not observational. If you question the existence of the effect, you’re basically questioning everything back to the Enlightenment. The implication is that no correlation is expected between CO2 and temperature, because integration breaks pattern matching. The authors purport to avoid integration by using first differences of temperature and CO2. But differencing both sides of the equation doesn’t solve the integration problem; it just kicks the can down the road. If y integrates x, then patterns of the integrals or derivatives of y and x won’t match either. Even worse differencing filters out the signals of interest.

Problem #2 is that the model above assumes only equation error (the term v(t) on the right hand side). In most situations, especially dynamic systems, both the “independent” (a misnomer) and dependent variables are subject to measurement error, and this dilutes the correlation or slope of the regression line (aka attenuation bias), and therefore also the IRF in the authors’ framework. In the case of temperature, the problem is particularly acute, because temperature also integrates internal variability of the climate system (weather) and some of this variability is autocorrelated on long time scales (because for example oceans have long time constants). That means the effective number of data points is a lot less than the 60 years or 720 months you’d expect from simple counting.

Dynamic variables are subject to other pathologies, generally under the heading of endogeneity bias, and related features with similar effects like omitted variable bias. Generalizing the approach to distributed lags in no way mitigates these. The bottom line is that absence of correlation doesn’t prove absence of causation.

Admittedly, even Nobel Prize winners can screw up claims about causality and correlation and estimate dynamic models with inappropriate methods. But causality confusion isn’t really a good way to get into that rarefied company.

I think methods purporting to assess causality exclusively from data are treacherous in general. The authors’ proposed method is provably wrong in some cases, including this one, as is Granger Causality. Even if you have pretty good assumptions, you’ll always find a system that violates them. That’s why it’s so important to take data-driven results with a grain of salt, and look for experimental control (where you can get it) and mechanistic explanations.

One way to tell if you’ve gotten causality wrong is when you “discover” mechanisms that are physically absurd. That happens on a spectacular scale in the third paper:

… we find Δ=23.5 and 8.1 Gt C/year, respectively, i.e., a total global increase in the respiration rate of Δ=31.6 Gt C/year. This rate, which is a result of natural processes, is 3.4 times greater than the CO2 emission by fossil fuel combustion (9.4 Gt C /year including cement production).

To put that in perspective, the authors propose a respiration flow that would put the biosphere about 30% out of balance. This implies a mass flow of trees harvested, soils destroyed, etc. 3.4 times as large as the planetary flow of fossil fuels. That would be about 4 cubic kilometers of wood, for example. In the face of the massive outflow from the biosphere, the 9.4 GtC/yr from fossil fuels went where, exactly? Extraordinary claims require extraordinary evidence, but the authors apparently haven’t pondered how these massive novel flows could be squared with other lines of evidence, like C isotopes, ocean Ph, satellite CO2, and direct estimates of land use emissions.

This “insight” is used to construct a model of the temperature->CO2 process:

In this model, the trend in CO2 is explained almost exclusively by the mean temperature effect mu_v = alpha*(T-T0). That effect is entirely ad hoc, with no basis in the impulse response framework.

How do we get into this pickle? I think the simple answer is that the authors’ specification of the system is incomplete. As above, they define a causal system,

y(t) = ∫g1(h)x(t-h)dh

x(t) = ∫g2(h)y(t-h)dh

where g(.) is an impulse response function weighting lags h and the integral is over h from 0 to infinity (because only nonnegative lags are causal). In their implementation, x and y are first differences, so in their climate example, Δlog(CO2) and ΔTemp. In the estimation of the impulse lag structures g(.), the authors impose nonnegativity and (optionally) smoothness constraints.

A more complete specification is roughly:

Y = A*X + U

dX/dt = B*X + E

where

• X is a vector of system states (e.g., CO2 and temperature)
• Y is a vector of measurements (observed CO2 and temperature)
• A and B are matrices of coefficients (this is a linear view of the system, but could easily be generalized to nonlinear functions)
• E is driving noise perturbing the state, and therefore integrated into it
• U is measurement error

My notation could be improved to consider covariance and state-dependent noise, though it’s not really necessary here. Fred Schweppe wrote all this out decades ago in Uncertain Dynamic Systems, and you can now find it in many texts like Stengel’s Optimal Control and Estimation. Dixit and Pindyck transplanted it to economics and David Peterson brought it to SD where it found its way into Vensim as the combination of Kalman filtering and optimization.

How does this avoid the pitfalls of the Koutsoyiannis et al. approach?

• An element of X can integrate any other element of X, including itself.
• There are no arbitrary restrictions (like nonnegativity) on the impulse response function.
• The system model (A, B, and any nonlinear elements augmenting the framework) can incorporate a priori structural knowledge (e.g., physics).
• Driving noise and measurement error are recognized and can be estimated along with everything else.

Does the difference matter? I’ll leave that for a second post with some examples.

## Correlation & causation – it’s complicated

It’s common to hear that correlation does not imply causation. It’s certainly true in the strong sense that observing a correlation between X and Y does not prove causation X->Y, because the true causality might be Y->X or Z->(X,Y).

Some wag (Feynman?) pointed out that “correlation does not imply causation, but it’s a good start.” This is also true. If you’re trying to understand the causes of Y, data mining for things that are correlated is a useful exploratory step, even if it proves nothing. If you find something, then you can look for plausible mechanisms, try experiments, etc.

Some go a little further than this, and combine Popper’s falsification with causality criteria to argue that lack of correlation does imply lack of causation. Unfortunately, this is untrue, for a number of reasons:

1. Measurement error – in OLS regression, the slope is just the correlation coefficient normalized by standard deviations. However, if there’s measurement error in the RHS variables, not just equation error affecting the LHS, the slope is affected by attenuation bias. In other words, a poor signal to noise ratio destroys apparent correlation, even when causality is present.
2. Integration – bathtub dynamics renders pattern matching incorrect, and destroys correlations, even in synthetic data experiments where causation is known to exist.
3. Nonlinearity – there are many possible bivariate patterns that result in a linear correlation coefficient of 0 despite an obvious (possibly causal) relationship.

Most systems have all three of these features to some extent, and they gain strength in combination. Noise integrates into the system stocks, and the slope or correlation of a relationship may reverse, depending on system state. Sugihara et al. show that Granger Causality fails, because “in deterministic dynamic systems (even noisy ones), if X is a cause for Y, information about X will be redundantly present in Y itself and cannot formally be removed….”

The common thread here is that no method can say much about causality if the assumptions neglect features of the system dynamics (integration or nonlinearity) or stochastic processes (measurement error and driving noise). Sometimes you get lucky, because you have a natural experiment, or high precision measurements, or simply loads of data about benign dynamics, but luck rarely coincides with big novel problems. Presence or absence of correlation is suggestive but far from definitive.

## Election Fraud and Benford’s Law

Statistical tests only make sense when the assumed distribution matches the data-generating process.

There are several analyses going around that purport to prove election fraud in PA, because the first digits of vote counts don’t conform to Benford’s Law. Here’s the problem: first digits of vote counts aren’t expected to conform to Benford’s Law. So, you might just as well say that election fraud is proved by Newton’s 3rd Law or Godwin’s Law.

Benford’s Law describes the distribution of first digits when the set of numbers evaluated derives from a scale-free or Power Law distribution spanning multiple orders of magnitude. Lots of processes generate numbers like this, including Fibonacci numbers and things that grow exponentially. Social networks and evolutionary processes generate Zipf’s Law, which is Benford-conformant.

Here’s the problem: vote counts may not have this property. Voting district sizes tend to be similar and truncated above (dividing a jurisdiction into equal chunks), and vote proportions tend to be similar due to gerrymandering and other feedback processes. This means the Benford’s Law assumptions are violated, especially for the first digit.

This doesn’t mean the analysis can’t be salvaged. As a check, look at other elections for the same region. Check the confidence bounds on the test, rather than simply plotting the sample against expectations. Examine the 2nd or 3rd digits to minimize truncation bias. Best of all, throw out Benford and directly simulate a distribution of digits based on assumptions that apply to the specific situation. If what you’re reading hasn’t done these things, it’s probably rubbish.

This is really no different from any other data analysis problem. A statistical test is meaningless, unless the assumptions of the test match the phenomena to be tested. You can’t look at lightning strikes the same way you look at coin tosses. You can’t use ANOVA when the samples are non-Normal, or have unequal variances, because it assumes Normality and equivariance. You can’t make a linear fit to a curve, and you can’t ignore dynamics. (Well, you can actually do whatever you want, but don’t propose that the results mean anything.)