Thyroid Dynamics: Dose Management Challenges

In my last two posts about thyroid dynamics, I described two key features of the information environment that set up a perfect storm for dose management:

  1. The primary indicator of the system state for a hypothyroid patient is TSH, which has a nonlinear (exponential) response to T3 and T4. This means you need to think about TSH on a log scale, but test results are normally presented on a linear scale. Information about the distribution is hard to come by. (I didn’t mention it before, but there’s also an element of the Titanic steering problem, because TSH moves in a direction opposite the dose and T3/T4.)
  2. Measurements of TSH are subject to a rather extreme mix of measurement error and driving noise (probably mostly the latter). Test results are generally presented without any indication of uncertainty, and doctors generally have very few data points to work with.

As if that weren’t enough, the physics of the system is tricky. A change in dose is reflected in T4 and T3, then in TSH, only after a delay. This is a classic “delayed negative feedback loop” situation, much like the EPO-anemia management challenge in the excellent work by Jim Rogers, Ed Gallaher & David Dingli.

If you have a model, like Rogers et al. do, you can make fairly rapid adjustments with confidence. If you don’t, you need to approach the problem like an unfamiliar shower: make small, slow adjustments. If you react two quickly, you’ll excite oscillations. Dose titration guidelines typically reflects this:

Titrate dosage by 12.5 to 25 mcg increments every 4 to 6 weeks, as needed until the patient is euthyroid.

Just how long should you wait before making a move? That’s actually a little hard to work out from the literature. I asked OpenEvidence about this, and the response was typically vague:

The expected time delay between adjusting the thyroid replacement dose and the response of thyroid-stimulating hormone (TSH) is typically around 4 to 6 weeks. This is based on the half-life of levothyroxine (LT4), which reaches steady-state levels by then, and serum TSH, which reaches its nadir at the same time.[1]

The first citation is the ATA guidelines, but when you consult the details, there’s no cited basis for the 4-6 weeks. Presumably this is some kind of 3-tau rule of thumb learned from experience. As an alternative, I tested a dose change in the Eisenberg et al. model:

At the arrow, I double the synthetic T4 dose on a hypothetical person, then observe the TSH trajectory. Normally, you could then estimate the time constant directly from the chart: 70% of the adjustment is realized at 1*tau, 85% at 2*tau, 95% at 3*tau. If you do that here, tau is about 8 days. But not so fast! TSH responds exponentially, so you need to look at this on a log-y scale:

Looking at this correctly, tau is somewhat longer: about 12-13 days. This is still potentially tricky, because the Eisenberg model is not first order. However, it’s reassuring that I get similar time constants when I estimate my own low-order metamodel.

Taking this result at face value, one could roughly say that TSH is 95% equilibrated to a dose change after about 5 weeks, which corresponds pretty well with the ATA guidelines.

This is a long setup for … the big mistake. Referring to the lettered episodes on the chart above, here’s what happened.

  • A: Dose is constant at about 200mcg (a little hard to be sure, because it was a mix of 2 products, and the equivalents aren’t well established.
  • B: New doctor orders a test, which comes out very low (.05), out of the recommended range. Given the long term dose-response range, we’d expect about .15 at this dose, so it seems likely that this was a confluence of dose-related factors and noise.
  • C: New doc orders an immediate drastic reduction of dose by 37.5% or 75mcg (3 to 6 times the ATA recommended adjustment).
  • D: Day 14 from dose change, retest is still low (.2). At this point you’d expect that TSH is at most 2/3 equilibrated to the new dose. Over extremely vociferous objections, doc orders another 30% reduction to 88mcg.
  • E: Patient feeling bad, experiencing hair loss and other symptoms. Goes off the reservation and uses remaining 125mcg pills. Coincident test is in range, though one would not expect it to remain so, because the dose changes are not equilibrated.
  • F: Suffering a variety of hypothyroid symptoms at the lower dose.
  • G: Retest after an appropriate long interval is far out of range on the high side (TSH near 7). Doc unresponsive.
  • H: Fired the doc. New doc restores dose to 125mcg immediately.
  • I: After an appropriate interval, retest puts TSH at 3.4, on the high side of the ATA range and above the NACB guideline. Doc adjusts to 175mcg, in part considering symptoms rather than test results.

This is an absolutely classic case of overshooting a goal in a delayed negative feedback system. There are really two problems here: failure to anticipate the delay, and therefore making a second adjustment before the first was stabilized, and making overly aggressive changes, much larger than guidelines recommend.

So, what’s really going on? I’ve been working with a simplified meta version of the Eisenberg model to figure this out. (The full model is hourly, and therefore impractical to run with Kalman filtering over multi-year horizons. It’s silly to use that much computation on a dozen data points.)

The problem is, the model can’t replicate the data without invoking huge driving noise – there simply isn’t any thing in the structure that can account for data points far from the median behavior. I’ve highlighted a few above. At each of these points, the model takes a huge jump, not because of any known dynamics, but because of a filter reset of the model state. This is a strong hint that there’s an unobserved state influencing the system.

If we could get docs to provide a retest at these outlier points, we could at least rule out measurement error, but that has almost never happened. Also, if docs would routinely order a full panel including T3 and T4, not just TSH, we might have a better mechanistic explanation, but that has also been hard to get. Recently, a doc ordered a full panel, but office staff unilaterally reduced the scope to TSH only, because they felt that testing T3 and T4 was “unconventional”. No doubt this is because ATA and some authors have been shouting that TSH is the only metric needed, and any nuances that arise when the evidence contradicts get lost.

For our N=1, the instability of the TSH/T4 relationship contradicts the conventional wisdom, which is that individuals have a stable set point., with the observed high population variation arising from diversity of set points across individuals:

I think the obvious explanation in our N=1 is that some individuals have an unstable set point. You could visualize that in the figure above as moving from one intersection of curves to another. This could arise from a change in the T4->TSH curve (e.g. something upstream of TSH in the hypothalamic-pituitary-adrenal axis) or the TSH->T4 relationship (intermittent secretion or conversion). Unfortunately very few treatment guidelines recognize this possibility.

Thyroid Dynamics: Chartjunk

I just ran across a funny instance of TSH nonlinearity. Check out the axis on this chart:

It’s actually not as bad as you’d think: the irregular axis is actually a decent approximation of a log-linear scale:

My main gripe is that the perceptual midpoint of the ATA range bar on the chart is roughly 0.9, whereas the true logarithmic midpoint is more like 1.6. The NACB bar is similarly distorted.

Thyroid Dynamics: Noise

A couple weeks ago I wrote about the perceptual challenges of managing thyroid stimulating hormone (TSH), which has an exponential response to the circulating thyroid hormones (T3 & T4) you’d actually like to control.

Another facet of the thyroid control problem is noise. Generally, uncertainty in measurements is not made available to users. For example, the lab results reported by MyChart have no confidence bounds: If you start looking for information on these tests, you’ll usually find precision estimates that sound pretty good – typically 5 to 7% error. (Example.) However, this understates the severity of the problem.

It’s well known that individual variation in the TSH<->T3,T4 setpoint is large, and the ATA guidelines mention this, if you read the detailed discussion. However, this is presented as a reason for the superiority of TSH measurements, “The logarithmic relationship between TSH and thyroid hormone bestows sensitivity: even if circulating T3 and T4 are in the normal range, it cannot be assumed that the subject is euthyroid. The interindividual ranges for T3 and T4 are much broader than the individual variance (), such that measuring T3 and T4 is a suboptimal way to assess thyroid status.” The control implications of variation over time within an individual are not mentioned.

The issue we face in our N=1 sample is unexplained longitudinal variation around the setpoint. In our data, this is HUGE. At a given dose, even during a long period of stability, variation in TSH is not 10%; it’s a factor of 10.

Now consider the problem facing a doc trying to titrate your dose in a 30-minute visit. They tested your TSH, and it’s 4, or .4, right at the high or low end o the recommended range. Should they adjust the dose? (The doc’s problem is actually harder than the data presented above suggests, because they never see this much data – changes in providers, labs and systems truncate the available information to just a few points.) In our experience, 3 out of 5 doctors do change the dose, even though the confidence bounds on these measurements are probably big enough to sail the Exxon Valdez through.

There is at last a paper that tackles this issue:

Individuals exhibit fluctuations in the concentration of serum thyroid-stimulating hormone (TSH) over time. The scale of these variations ranges from minutes to hours, and from months to years. The main factors contributing to the observed within-person fluctuations in serum TSH comprise pulsatile secretion, circadian rhythm, seasonality, and ageing.

I think the right response is actually the byline of this blog: don’t just do something, stand there! If one measurement potentially has enormous variation, the first thing you should probably do is leave the dose alone and retest after a modest time. On several occasions, we have literally begged for such a retest, and been denied.

The consequence of test aversion is that we have only 20 data points over 8 years, and almost none in close proximity to one another. That makes it impossible to determine whether the variation we’re seeing is measurement error (blood draw or lab methods), fast driving noise (circadian effects), or slow trends (e.g., seasonal). I’ve been fitting models to the data for several years, but this sparsity and uncertainty gives the model fits. Here’s an example:

At the highlighted point (and half a dozen others), the model finds the data completely inexplicable. The Kalman filter moves the model dramatically towards the data (the downward spike in the red curve), but only about halfway, because the estimate yields both high measurement error and high driving noise in TSH. Because the next measurement doesn’t occur for 4 months, there’s no way to sort out which is which.

This extreme noise, plus nonlinearity previously mentioned, is really a perfect setup for errors in dose management. I’ll describe one or two in a future post.

The Blood-Hungry Spleen

OK, I’ve stolen another title, this time from a favorite kids’ book. This post is really about the thyroid, which is a little less catchy than the spleen.

Your hormones are exciting!
They stir your body up.
They’re made by glands (called endocrine)
and give your body pluck.

Allan Wolf & Greg Clarke, The Blood-Hungry Spleen

A friend has been diagnosed with hypothyroidism, so I did some digging on the workings of the thyroid. A few hours searching citations on PubMed, Medline and google gave me enough material to create this diagram:

Thyroid function and some associated feedbacks

(This is a LARGE image, so click through and zoom in to do it justice.)

The bottom half is the thyroid control system, as it is typically described. The top half strays into the insulin regulation system (borrowed from a classic SD model), body fat regulation, and other areas that seem related. A lot of the causal links above are speculative, and I have little hope of turning the diagram into a running model. Unfortunately, I can’t find anything in the literature that really digs into the dynamics of the system. In fact, I can’t even find the basics – how much stuff is in each stock, and how long does it stay there? There is a core of the system that I hope to get running at some point though:

Thyroid - core regulation and dose titration

(another largish image)

This is the part of the system that’s typically involved in the treatment of hypothyroidism with synthetic hormone replacements. Normally, the body runs a negative feedback loop in which thyroid hormone levels (T3/T4) govern production of TSH, which in turn controls the production of T3 and T4. The problem begins when something (perhaps an autoimmune disease, i.e. Hashimoto’s) diminishes the thyroid’s ability to produce T3 and T4 (reducing the two inflows in the big yellow box at center). Then therapy seeks to replace the natural control loop, by adjusting a dose of synthetic T4 (levothyroxine) until the measured level of TSH (left stock structure) reaches a desired target.

This is a negative feedback loop with fairly long delays, so dosage adjustments are made only at infrequent intervals, in order to allow the system to settle between changes. Otherwise, you’d have the aggressive shower taker problem: water’s to cold, crank up the hot water … ouch, too hot, turn it way down … eek, too cold …. Measurements of T3 and T4 are made, but seldom paid much heed – the TSH level is regarded as the “gold standard.”

This black box approach to control is probably effective for many patients, but it leaves me feeling uneasy about several things. The “normal” range for TSH varies by an order of magnitude; what basis is there for choosing one or the other end of the range as a target? Wouldn’t we expect variation among patients in the appropriate target level? How do we know that TSH levels are a reliable indicator, if they don’t correlate well with T3/T4 levels or symptoms? Are extremely sparse measurements of TSH really robust to variability on various time scales, or is dose titration vulnerable to noise?

One could imagine alternative approaches to control, using direct measurements of T3 and T4, or indirect measurements (symptoms). Those might have the advantage of less delay (fewer confounding states between the goal state and the measured state). But T3/T4 measurements seem to be regarded as unreliable, which might have something to do with the fact that it’s hard to find any information on the scale or dynamics of their reservoirs. Symptoms also take a back seat; one paper even demonstrates fairly convincingly that dosage changes +/- 25% have no effect on symptoms (so why are we doing this again?).

I’d like to have a more systemic understanding of both the internal dynamics of the thyroid regulation system, and its interaction with symptoms, behaviors, and other regulatory systems. Here’s hoping that one of you lurkers (I know you’re out there) can comment with some thoughts or references.

So the spleen doesn’t feel shortchanged, I’ll leave you with another favorite:

I think that I ain’t never seen
A poem ugly as a spleen.
A poem that could make you shiver
Like 3.5 … pounds of liver.
A poem to make you lose your lunch,
Tie your intestines in a bunch.
A poem all gray, wet, and swollen,
Like a stomach or a colon.
Something like your kidney, lung,
Pancreas, bladder, even tongue.
Why you turning green, good buddy?
It’s just human body study.

John Scieszka & Lane Smith, Science Verse