Tom – Page 4

More reasons to love emissions pricing

I was flipping through a recent Tech Review, and it seemed like every other article was an unwitting argument for emissions pricing. Two examples:

Job title of the future: carbon accountant

We need carbon engineers who know how to make emissions go away more than we need bean counters to tally them. Are we also going to have nitrogen accountants, and PFAS accountants, and embodied methane in iridium accountants, and … ? That way lies insanity.

The fact is, if carbon had a nontrivial price attached at the wellhead, it would pervade the economy, and we’d already have carbon accountants. They’re called accountants.

More importantly, behind those accountants is an entire infrastructure of payment systems that enforces conservation of money. You can only cheat an accounting system for so long, before the cash runs out. We can’t possibly construct parallel systems providing the same robustness for every externality we’re interested in.

Here’s what we know about lab-grown meat and climate change

Realistically, now matter how hard we try to work out the relative emissions of natural and laboratory cows, the confidence bounds on the answer will remain wide until the technology is used at scale.

We can’t guide that scaling process by assessments that are already out of date when they’re published. Lab meat innovators need a landscape in which carbon is priced into their inputs, so they can make the right choices along the way.

Model quality: draining the swamp for large models

In my high road post a while ago, I advocated “voluntary simplicity” as a way of avoiding a large model with an insurmountable burden of undiscovered rework.

Sometimes this is not a choice, because you’re asked to repurpose a large model for some related question. Maybe you didn’t build it, or maybe you did, but it’s time to pause and reflect before proceeding. It’s critical to determine where on the spectrum above the model lies – between the Vortex of Confusion and Nirvana.

I will assert from experience that a large model was almost always built with some narrow questions in mind, and that it was exercised mainly over parts of its state space relevant to those questions. It’s quite likely that a lot of bad behaviors lurk in other parts of the state space. When you repurpose the model those things are going to bite you.

If you start down the red road, “we’ll just add a little stuff for the new question…” you will find yourself in a world of hurt later. It’s absolutely essential that you first do some rigorous testing to establish what the limitations of the model might be in the new context.

The reason scope is so dangerous is that its effect on your ability to make quality progress is nonlinear. The number of interactions you have to manage, and therefore opportunities for errors, and complexity of corrections, is proportional to scope squared. The speed of your model declines with 1/scope, as does the time you have to pay attention to each variable.

My tentative recipe for success, or at least survival:

1. Start early. Errors beget more errors, so the sooner you discover them, the sooner you can arrest that vicious cycle.

2. Be ruthless. Don’t test to see if the model can answer the new question; test to see if you can break it and get nonsense answers.

3. Use your tools. Pay attention to unit errors and runtime warnings. Write Reality Checks to automate tests. Set ranges on key variables to ensure that they’re within reason.

4. Isolate. Because of the nonlinear interaction problem, it’s hard to interpret tests on a full model. Instead, extract components and test them in isolation. You can do this by copy-pasting, or even easier in Vensim, by using Synthesim Overrides to modify inputs to steps, ramps, etc.

5. Don’t let go. When you find a problem, track it back to its root cause.

6. Document. Keep a lab notebook, or an email stream, or a todo list, so you don’t lose the thought when you have multiple issues to chase.

7. Be extreme. Pick a stock and kick it with a pulse or an override. Take all the people out of the factory, or all the ships out of the fleet. What happens? Does anything go negative? Do decisions remain consistent with goals?

8. Calibrate. Calibration against data can be a useful way to find issues, but it’s much slower than other options, so this is something to pursue late in the process. Also, remember that model-data gaps are equally likely to reveal a problem with the data.

9. Rebuild. If you’re finding a lot of problems, you may be better of starting clean, using the existing model as a conceptual guide, but reconsidering the detailed design of the implementation.

10. Submodel. It’s often hard to fix something inside the full plate of spaghetti. But you may be able to identify a solution in an external submodel, free of distractions, and then transplant it back into the host.

11. Reduce. If you can’t rebuild the full scope within available resources, cut things out. This may not be appetizing to the client, but it’s certainly better than delivering a fragile model that only works if you don’t touch it.

12. If you find you’re in a hole, stop digging. Don’t add any features until you have things under control, because they’ll only exacerbate the problems.

13. Communicate. Let the client, and your team, know what you’re up to, and why quality is more important than their cherished kitchen sink.

Integration & Correlation

Claims by AI chatbots, engineers and Nobel prize winners notwithstanding, absence of correlation does not prove absence of causation, any more than presence of correlation proves presence of causation. Bard outlines several reasons from noise and nonlinearity, but missed a key one: bathtub statistics.

Here’s a really simple example of how this reasoning can go wrong. Consider a system with a stock Y(t) that integrates a flow X(t):

X(t) = -t

Y(t) = ∫X(t)dt

We don’t need to simulate to solve for Y(t) = -1/2*t^2 +C.

Over the interval t=[-1,1] the X and Y time series look like this:

The X-Y relationship is parabolic, with correlation zero:

Zero correlation can’t mean “not causal” because we constructed the system to be causal. Even worse, the sign of the relationship depends on the subset of the interval you examine:

This is not the only puzzling case. Consider instead:

X(t) = 1

Y(t) = ∫X(t)dt = t + C

In this case, X(t) has zero variance. But Corr(X,Y) = Cov(X,Y)/σ(X)σ(Y) which is 0/0. What are we to make of that?

This pathology can also arise from feedback. Consider a thermostat that controls a heater that operates in two states (on or off). If the heater is fast, and the thermostat is sensitive with a narrow temperature band, then σ(temperature) will be near 0, even though the heater is cycling with σ(heater state)>0.

AI Chatbots on Causality

Having recently encountered some major causality train wrecks, I got curious about LLM “understanding” of causality. If AI chatbots are trained on the web corpus, and the web doesn’t “get” causality, there’s no reason to think that AI will make sense either.

TLDR; ChatGPT and Bing utterly fail this test, for reasons that are evident in Google Bard’s surprisingly smart answer.

ChatGPT: FAIL

Bing: FAIL

Google Bard: PASS

Google gets strong marks for mentioning a bunch of reasons to expect that we might not find a correlation, even though x is known to cause y. I’d probably only give it a B+, because it neglected integration and feedback, but it’s a good answer that properly raises lots of doubts about simplistic views of causality.

Thyroid Dynamics: Hyper Resurgence

In my last thyroid post, I described a classic case of overshoot due to failure to account for delays. I forgot to mention the trigger for this episode.

At point B above, there was a low TSH measurement, at .05 well below the recommended floor of .4. That was taken as a signal for a dose reduction, which is qualitatively reasonable.

Let’s suppose we believe the dose-TSH response to be stable:

Then we have some puzzles. First, at 200mcg, we’d expect TSH=.15, about 3x higher. To get the observed measurement, we’d expect the dose to be more like 225mcg. Second, exactly the same reading of .05 has been observed at a much lower dose (162.5mcg), which is in the sweet spot (yellow box) we should be targeting. Third, also within that sweet spot, at 150mcg, we’ve seen TSH as high as 15 – far out of range in the opposite direction.

I think an obvious conclusion is that noise in the system is extreme, so there’s good reason to respond by discounting the measurement and retesting. But that’s not what happens in general. Here’s a plot of TSH observations (x, log scale) against subsequent dose adjustments (y, %):
There are three clusters of note.

The yellow-highlighted points are low TSH values that were followed by large dose reductions, exceeding guidelines.
The green points are large dose increases needed to restore the yellow changes, when they subsequently proved to be errors.
The purple points (3 total) are high TSH readings, right at the top of the recommended range, that did not induce a dose increase, even though they were accompanied by symptom complaints.

This is interesting, because the trendline seems to indicate a reasonable, if noisy, strategy of targeting TSH=1. But the operative decision rule for several of the doctors involved seems to be more like:

If you get a TSH measurement at the high end of the range, indicating a dose increase might be appropriate, ignore it.
If you get a low TSH measurement, PANIC. Cut dose drastically.
If you’re the doctor replacing the one just fired for screwing this up, restore the status quo.

Why is this? I think it’s an error in reasoning. Low TSH could be caused by excessive T4 levels, which could arise from (a) overtreatment of a hypothyroid patient, or (b) hyperthyroid activity in a previously hypothyroid patient. In the case described previously, evidence from T4 testing as well as the long term relationship suggested that the dose was 20-30% high, but it was ultimately reduced by 60%. But in two other cases, there was no T4 confirmation, and the dose was right in the middle of its apparent sweet spot. That rules out overtreatment, so the mental model behind a dose reduction has to be (b). But that makes no sense. It’s a negative feedback system, yet somehow the thyroid has increased its activity, in response to a reduction in the hormone that normally signals it to do so? Admittedly, there are possibilities like cancer that could explain such behavior, but no one has ever explored that possibility in N=1’s case.

I think the basic problem here is that it’s hard to keep a mechanistic model of a complex hormone signalling system in your head, which makes it easy to get fooled by delays, feedback, noise and nonlinearity. Bad information systems and TSH monomania contribute to the problem, as does ignoring dose guidelines due to overconfidence.

So what should happen in response to a low TSH measurement in patient N=1? I think it’s more like the following:

Don’t panic.
- It might be a bad measurement (labs don’t correct for seasonality, time of day, and other features that could inflate variance beyond the precision of the test itself).
- It might be some unknown source of variability driving TSH, like food, medications, or endogenous variation in upstream hormones.
Look at the measurement in context of other information: the past dose-response relationship, T4, and symptoms, and reference dose per unit body mass.
Make at most a small move, wait for the guideline-prescribed period, and retest.

Thyroid Dynamics: Dose Management Challenges

In my last two posts about thyroid dynamics, I described two key features of the information environment that set up a perfect storm for dose management:

The primary indicator of the system state for a hypothyroid patient is TSH, which has a nonlinear (exponential) response to T3 and T4. This means you need to think about TSH on a log scale, but test results are normally presented on a linear scale. Information about the distribution is hard to come by. (I didn’t mention it before, but there’s also an element of the Titanic steering problem, because TSH moves in a direction opposite the dose and T3/T4.)
Measurements of TSH are subject to a rather extreme mix of measurement error and driving noise (probably mostly the latter). Test results are generally presented without any indication of uncertainty, and doctors generally have very few data points to work with.

As if that weren’t enough, the physics of the system is tricky. A change in dose is reflected in T4 and T3, then in TSH, only after a delay. This is a classic “delayed negative feedback loop” situation, much like the EPO-anemia management challenge in the excellent work by Jim Rogers, Ed Gallaher & David Dingli.

If you have a model, like Rogers et al. do, you can make fairly rapid adjustments with confidence. If you don’t, you need to approach the problem like an unfamiliar shower: make small, slow adjustments. If you react two quickly, you’ll excite oscillations. Dose titration guidelines typically reflects this:

Titrate dosage by 12.5 to 25 mcg increments every 4 to 6 weeks, as needed until the patient is euthyroid.

Just how long should you wait before making a move? That’s actually a little hard to work out from the literature. I asked OpenEvidence about this, and the response was typically vague:

The expected time delay between adjusting the thyroid replacement dose and the response of thyroid-stimulating hormone (TSH) is typically around 4 to 6 weeks. This is based on the half-life of levothyroxine (LT4), which reaches steady-state levels by then, and serum TSH, which reaches its nadir at the same time.[1]

The first citation is the ATA guidelines, but when you consult the details, there’s no cited basis for the 4-6 weeks. Presumably this is some kind of 3-tau rule of thumb learned from experience. As an alternative, I tested a dose change in the Eisenberg et al. model:

At the arrow, I double the synthetic T4 dose on a hypothetical person, then observe the TSH trajectory. Normally, you could then estimate the time constant directly from the chart: 70% of the adjustment is realized at 1*tau, 85% at 2*tau, 95% at 3*tau. If you do that here, tau is about 8 days. But not so fast! TSH responds exponentially, so you need to look at this on a log-y scale:

Looking at this correctly, tau is somewhat longer: about 12-13 days. This is still potentially tricky, because the Eisenberg model is not first order. However, it’s reassuring that I get similar time constants when I estimate my own low-order metamodel.

Taking this result at face value, one could roughly say that TSH is 95% equilibrated to a dose change after about 5 weeks, which corresponds pretty well with the ATA guidelines.

This is a long setup for … the big mistake. Referring to the lettered episodes on the chart above, here’s what happened.

A: Dose is constant at about 200mcg (a little hard to be sure, because it was a mix of 2 products, and the equivalents aren’t well established.
B: New doctor orders a test, which comes out very low (.05), out of the recommended range. Given the long term dose-response range, we’d expect about .15 at this dose, so it seems likely that this was a confluence of dose-related factors and noise.
C: New doc orders an immediate drastic reduction of dose by 37.5% or 75mcg (3 to 6 times the ATA recommended adjustment).
D: Day 14 from dose change, retest is still low (.2). At this point you’d expect that TSH is at most 2/3 equilibrated to the new dose. Over extremely vociferous objections, doc orders another 30% reduction to 88mcg.
E: Patient feeling bad, experiencing hair loss and other symptoms. Goes off the reservation and uses remaining 125mcg pills. Coincident test is in range, though one would not expect it to remain so, because the dose changes are not equilibrated.
F: Suffering a variety of hypothyroid symptoms at the lower dose.
G: Retest after an appropriate long interval is far out of range on the high side (TSH near 7). Doc unresponsive.
H: Fired the doc. New doc restores dose to 125mcg immediately.
I: After an appropriate interval, retest puts TSH at 3.4, on the high side of the ATA range and above the NACB guideline. Doc adjusts to 175mcg, in part considering symptoms rather than test results.

This is an absolutely classic case of overshooting a goal in a delayed negative feedback system. There are really two problems here: failure to anticipate the delay, and therefore making a second adjustment before the first was stabilized, and making overly aggressive changes, much larger than guidelines recommend.

So, what’s really going on? I’ve been working with a simplified meta version of the Eisenberg model to figure this out. (The full model is hourly, and therefore impractical to run with Kalman filtering over multi-year horizons. It’s silly to use that much computation on a dozen data points.)

The problem is, the model can’t replicate the data without invoking huge driving noise – there simply isn’t any thing in the structure that can account for data points far from the median behavior. I’ve highlighted a few above. At each of these points, the model takes a huge jump, not because of any known dynamics, but because of a filter reset of the model state. This is a strong hint that there’s an unobserved state influencing the system.

If we could get docs to provide a retest at these outlier points, we could at least rule out measurement error, but that has almost never happened. Also, if docs would routinely order a full panel including T3 and T4, not just TSH, we might have a better mechanistic explanation, but that has also been hard to get. Recently, a doc ordered a full panel, but office staff unilaterally reduced the scope to TSH only, because they felt that testing T3 and T4 was “unconventional”. No doubt this is because ATA and some authors have been shouting that TSH is the only metric needed, and any nuances that arise when the evidence contradicts get lost.

For our N=1, the instability of the TSH/T4 relationship contradicts the conventional wisdom, which is that individuals have a stable set point., with the observed high population variation arising from diversity of set points across individuals:

I think the obvious explanation in our N=1 is that some individuals have an unstable set point. You could visualize that in the figure above as moving from one intersection of curves to another. This could arise from a change in the T4->TSH curve (e.g. something upstream of TSH in the hypothalamic-pituitary-adrenal axis) or the TSH->T4 relationship (intermittent secretion or conversion). Unfortunately very few treatment guidelines recognize this possibility.

Thyroid Dynamics: Chartjunk

I just ran across a funny instance of TSH nonlinearity. Check out the axis on this chart:

It’s actually not as bad as you’d think: the irregular axis is actually a decent approximation of a log-linear scale:

My main gripe is that the perceptual midpoint of the ATA range bar on the chart is roughly 0.9, whereas the true logarithmic midpoint is more like 1.6. The NACB bar is similarly distorted.

Thyroid Dynamics: Noise

A couple weeks ago I wrote about the perceptual challenges of managing thyroid stimulating hormone (TSH), which has an exponential response to the circulating thyroid hormones (T3 & T4) you’d actually like to control.

Another facet of the thyroid control problem is noise. Generally, uncertainty in measurements is not made available to users. For example, the lab results reported by MyChart have no confidence bounds: If you start looking for information on these tests, you’ll usually find precision estimates that sound pretty good – typically 5 to 7% error. (Example.) However, this understates the severity of the problem.

It’s well known that individual variation in the TSH<->T3,T4 setpoint is large, and the ATA guidelines mention this, if you read the detailed discussion. However, this is presented as a reason for the superiority of TSH measurements, “The logarithmic relationship between TSH and thyroid hormone bestows sensitivity: even if circulating T₃ and T₄ are in the normal range, it cannot be assumed that the subject is euthyroid. The interindividual ranges for T₃ and T₄ are much broader than the individual variance (369), such that measuring T₃ and T₄ is a suboptimal way to assess thyroid status.” The control implications of variation over time within an individual are not mentioned.

The issue we face in our N=1 sample is unexplained longitudinal variation around the setpoint. In our data, this is HUGE. At a given dose, even during a long period of stability, variation in TSH is not 10%; it’s a factor of 10.

Now consider the problem facing a doc trying to titrate your dose in a 30-minute visit. They tested your TSH, and it’s 4, or .4, right at the high or low end o the recommended range. Should they adjust the dose? (The doc’s problem is actually harder than the data presented above suggests, because they never see this much data – changes in providers, labs and systems truncate the available information to just a few points.) In our experience, 3 out of 5 doctors do change the dose, even though the confidence bounds on these measurements are probably big enough to sail the Exxon Valdez through.

There is at last a paper that tackles this issue:

Individuals exhibit fluctuations in the concentration of serum thyroid-stimulating hormone (TSH) over time. The scale of these variations ranges from minutes to hours, and from months to years. The main factors contributing to the observed within-person fluctuations in serum TSH comprise pulsatile secretion, circadian rhythm, seasonality, and ageing.

I think the right response is actually the byline of this blog: don’t just do something, stand there! If one measurement potentially has enormous variation, the first thing you should probably do is leave the dose alone and retest after a modest time. On several occasions, we have literally begged for such a retest, and been denied.

The consequence of test aversion is that we have only 20 data points over 8 years, and almost none in close proximity to one another. That makes it impossible to determine whether the variation we’re seeing is measurement error (blood draw or lab methods), fast driving noise (circadian effects), or slow trends (e.g., seasonal). I’ve been fitting models to the data for several years, but this sparsity and uncertainty gives the model fits. Here’s an example:

At the highlighted point (and half a dozen others), the model finds the data completely inexplicable. The Kalman filter moves the model dramatically towards the data (the downward spike in the red curve), but only about halfway, because the estimate yields both high measurement error and high driving noise in TSH. Because the next measurement doesn’t occur for 4 months, there’s no way to sort out which is which.

This extreme noise, plus nonlinearity previously mentioned, is really a perfect setup for errors in dose management. I’ll describe one or two in a future post.

Climate Causality Confusion

A newish set of papers (1. Theory (preprint); 2. Applications (preprint); 3. Extension) is making the rounds on the climate skeptic sites, with – ironically – little skepticism applied.

The claim is bold:

… According to the commonly assumed causality link, increased [CO₂] causes a rise in T. However, recent developments cast doubts on this assumption by showing that this relationship is of the hen-or-egg type, or even unidirectional but opposite in direction to the commonly assumed one. These developments include an advanced theoretical framework for testing causality based on the stochastic evaluation of a potentially causal link between two processes via the notion of the impulse response function. …. All evidence resulting from the analyses suggests a unidirectional, potentially causal link with T as the cause and [CO₂] as the effect.

Galileo complex seeps in when the authors claim that absence of correlation or impulse response from CO2 -> temperature proves absence of causality:

Clearly, the results […] suggest a (mono-directional) potentially causal system with T as the cause and [CO₂] as the effect. Hence the common perception that increasing [CO₂] causes increased T can be excluded as it violates the necessary condition for this causality direction.

Unfortunately, these claims are bogus. Here’s why.

The authors estimate impulse response functions between CO2 and temperature (and back), using the following formalism:

where g(h) is the response at lag h. As the authors point out, if

the IRF is zero for every lag except for the specific lag $h_{0}$ , then Equation (1) becomes y(t)=bx(t-h0) +v(t). This special case is equivalent to simply correlating y(t) with x(t-h0) at any time instance $t$ . It is easy to find (cf. linear regression) that in this case the multiplicative constant $b$ is the correlation coefficient of y(t) and x(t-h0) multiplied by the ratio of the standard deviations of the two processes.

Now … anyone who claims to have an “advanced theoretical framework for testing causality” should be aware of the limitations of linear regression. There are several possible issues that might lead to misleading conclusions about causality.

Problem #1 here is bathtub statistics. Temperature integrates the radiative forcing from CO2 (and other things). This is not debatable – it’s physics. It’s old physics, and it’s experimental, not observational. If you question the existence of the effect, you’re basically questioning everything back to the Enlightenment. The implication is that no correlation is expected between CO2 and temperature, because integration breaks pattern matching. The authors purport to avoid integration by using first differences of temperature and CO2. But differencing both sides of the equation doesn’t solve the integration problem; it just kicks the can down the road. If y integrates x, then patterns of the integrals or derivatives of y and x won’t match either. Even worse differencing filters out the signals of interest.

Problem #2 is that the model above assumes only equation error (the term v(t) on the right hand side). In most situations, especially dynamic systems, both the “independent” (a misnomer) and dependent variables are subject to measurement error, and this dilutes the correlation or slope of the regression line (aka attenuation bias), and therefore also the IRF in the authors’ framework. In the case of temperature, the problem is particularly acute, because temperature also integrates internal variability of the climate system (weather) and some of this variability is autocorrelated on long time scales (because for example oceans have long time constants). That means the effective number of data points is a lot less than the 60 years or 720 months you’d expect from simple counting.

Dynamic variables are subject to other pathologies, generally under the heading of endogeneity bias, and related features with similar effects like omitted variable bias. Generalizing the approach to distributed lags in no way mitigates these. The bottom line is that absence of correlation doesn’t prove absence of causation.

Admittedly, even Nobel Prize winners can screw up claims about causality and correlation and estimate dynamic models with inappropriate methods. But causality confusion isn’t really a good way to get into that rarefied company.

I think methods purporting to assess causality exclusively from data are treacherous in general. The authors’ proposed method is provably wrong in some cases, including this one, as is Granger Causality. Even if you have pretty good assumptions, you’ll always find a system that violates them. That’s why it’s so important to take data-driven results with a grain of salt, and look for experimental control (where you can get it) and mechanistic explanations.

One way to tell if you’ve gotten causality wrong is when you “discover” mechanisms that are physically absurd. That happens on a spectacular scale in the third paper:

… we find $Δ R = 23.5$ and 8.1 Gt C/year, respectively, i.e., a total global increase in the respiration rate of $Δ R = 31.6$ Gt C/year. This rate, which is a result of natural processes, is 3.4 times greater than the CO₂ emission by fossil fuel combustion (9.4 Gt C /year including cement production).

To put that in perspective, the authors propose a respiration flow that would put the biosphere about 30% out of balance. This implies a mass flow of trees harvested, soils destroyed, etc. 3.4 times as large as the planetary flow of fossil fuels. That would be about 4 cubic kilometers of wood, for example. In the face of the massive outflow from the biosphere, the 9.4 GtC/yr from fossil fuels went where, exactly? Extraordinary claims require extraordinary evidence, but the authors apparently haven’t pondered how these massive novel flows could be squared with other lines of evidence, like C isotopes, ocean Ph, satellite CO2, and direct estimates of land use emissions.

This “insight” is used to construct a model of the temperature->CO2 process:

In this model, the trend in CO2 is explained almost exclusively by the mean temperature effect mu_v = alpha*(T-T0). That effect is entirely ad hoc, with no basis in the impulse response framework.

How do we get into this pickle? I think the simple answer is that the authors’ specification of the system is incomplete. As above, they define a causal system,

y(t) = ∫g1(h)x(t-h)dh

x(t) = ∫g2(h)y(t-h)dh

where g(.) is an impulse response function weighting lags h and the integral is over h from 0 to infinity (because only nonnegative lags are causal). In their implementation, x and y are first differences, so in their climate example, Δlog(CO2) and ΔTemp. In the estimation of the impulse lag structures g(.), the authors impose nonnegativity and (optionally) smoothness constraints.

A more complete specification is roughly:

Y = A*X + U

dX/dt = B*X + E

where

X is a vector of system states (e.g., CO2 and temperature)
Y is a vector of measurements (observed CO2 and temperature)
A and B are matrices of coefficients (this is a linear view of the system, but could easily be generalized to nonlinear functions)
E is driving noise perturbing the state, and therefore integrated into it
U is measurement error

My notation could be improved to consider covariance and state-dependent noise, though it’s not really necessary here. Fred Schweppe wrote all this out decades ago in Uncertain Dynamic Systems, and you can now find it in many texts like Stengel’s Optimal Control and Estimation. Dixit and Pindyck transplanted it to economics and David Peterson brought it to SD where it found its way into Vensim as the combination of Kalman filtering and optimization.

How does this avoid the pitfalls of the Koutsoyiannis et al. approach?

An element of X can integrate any other element of X, including itself.
There are no arbitrary restrictions (like nonnegativity) on the impulse response function.
The system model (A, B, and any nonlinear elements augmenting the framework) can incorporate a priori structural knowledge (e.g., physics).
Driving noise and measurement error are recognized and can be estimated along with everything else.

Does the difference matter? I’ll leave that for a second post with some examples.

Thyroid Dynamics: Misperceptions of Nonlinearity

I’ve been working on thyroid dynamics, tracking a friend’s data and seeking some understanding with models. With only one patient to go on, it’s impossible to generalize about thyroid behavior from one time series (though there are some interesting features I’ll report on later). On the other hand, our sample size of doctors is now around 10, and I’m starting to see some persistent misperceptions leading to potentially dangerous errors. One of the big issues is simple.

The principal indicator used for thyroid diagnosis is TSH (thyroid stimulating hormone). TSH regulates production of T4 and T3, T3 being the metabolically active hormone. T4 and T3 in turn downregulate TSH (via TRH), producing a negative feedback loop. The basis for the target range for TSH in thyroid treatment is basically the distribution of TSH in the general population without a thyroid diagnosis.

The challenge with TSH is that its response is logarithmic, so its distribution is lognormal. The usual target range is 0.4 to 4 mIU/L (or .45 to 4.5, or something else, depending on which source you prefer). Anyway, suppose you test at 2.2 – bingo! right in the middle! Well, not so fast. The geometric mean of .4 and 4.4 is actually 1.6, so you’re a little high.

How high? Well, no one will tell you without a fight. For some reason, most sources insist on throwing out much of the relevant information about the distribution of “normal”. In fact, when you look at the large survey papers reporting on population health, like NHANES, it’s hard to find the distribution. For example, Thyroid Profile of the Reference United States Population: Data from NHANES 2007-2012 (Jain 2015) doesn’t have a single visualization of the data – just a bunch of tables. When you do find the distribution, you’ll often get a subset (smokers) or a linear-scaled version that makes it hard to see the left tail. (For the record, TSH=2.2 is just above the 75th percentile in the NHANES THYROD_G dataset – so already quite far from the middle.)

There are also more subtle issues. In the first NHANES thyroid survey article, I found Fig. 1:

Here we have a log scale, but bins of convenience. The breakpoints 1, 2, 3, 5, 10… happen to be roughly equally spaced on a log scale. But when you space histogram bins at these intervals, the bin width is very rough indeed – varying by almost a factor of 2. That means the shape of the distribution is distorted. One of the bins expected to be small is the 2.1-3 range, and you can actually see here that those columns look anomalously low, compared to what you’d expect of a nice bell curve.

That was 20 years ago, but things are no better with modern analytics. If you get a blood test now, your results are likely to be reported to you, and your doc, though a system like MyChart. For TSH, this means you’ll get a range plot with a linear scale:

backed up by a time series with a linear scale:

Notice that the “normal” range doesn’t match the ATA recommendation or any other source I’ve seen. Presumably it’s the lab’s +/- 2 standard deviation range or something like that. That’s bad, because the upper limit – 4.82 – is above every clinical association recommendation I’ve seen. The linear scale squashes all the variation around low values, and exaggerates the high ones.

Given that the information systems for the entire thyroid management enterprise offer biased, low-information displays of TSH stats, I think it’s not surprising that physicians have trouble overcoming the nonlinearity bias. They have hundreds of variables to think about, and can’t possibly be expected to maintain a table of log or Z transformations in their heads. Patients are probably even more baffled by the asymmetry.

It would be simple to remedy this by presenting the information in a way that minimizes the cognitive burden on the viewer. Reporting TSH on a log scale is trivial. Reporting percentiles as a complementary statistic would also be trivial, and percentiles are widely used, so you don’t have to explain them. Setting a target value rather than a target range would encourage driving without bouncing off the guardrails. I hope authors and providers can figure this out.