ChatGPT plays the Beer Game

Well, sort of. I decided to use the Beer Game inventory management framework as a test of reasoning in a dynamic situation. I changed “beer” to “televisions” for easy typing, and to avoid triggering any safety provisions on alcohol, but otherwise I tried to stay close to the classic game.

You

I run a retail store. Most weeks, we sell 4 televisions. We typically have 8 televisions in inventory. Each week, we order more televisions to restock the ones sold and maintain our inventory. It takes 2 weeks to receive a television we order. It costs $1 to keep a television in stock for a week, but the opportunity cost of lost sales if no televisions are in stock is $2. This week we sold 8 televisions. How many should we order this week?

ChatGPT

To determine how many televisions you should order this week, let’s break down the process step by step:

1. *Current Inventory:* You have 8 televisions in inventory.

2. *Demand:* This week, you sold 8 televisions.

3. *Lead Time:* It takes 2 weeks to receive a television after ordering.

4. *Costs:*

* Holding Cost: $1 per television per week.
* Stockout Cost: $2 per television per week.

5. *Objective:* Minimize the total cost, considering both holding and stockout costs.

Given that it takes 2 weeks for new stock to arrive, you need to cover the demand for the next 2 weeks.

Let’s calculate:

* *Demand for the next 2 weeks:* 8 televisions per week * 2 weeks = 16 televisions

Now, let’s calculate the total cost for different ordering quantities:

*Scenario 1:* Ordering 16 televisions

o Holding cost for 2 weeks: $1 * 16 televisions * 2 weeks = $32
o No stockout cost
o Total Cost: $32

*Scenario 2:* Ordering 17 televisions

o Holding cost for 2 weeks: $1 * 17 televisions * 2 weeks = $34
o Stockout cost for 1 television for 1 week: $2 * 1 television * 1 week = $2
o Total Cost: $34 + $2 = $36

*Scenario 3:* Ordering 18 televisions

o Holding cost for 2 weeks: $1 * 18 televisions * 2 weeks = $36
o Stockout cost for 2 televisions for 1 week: $2 * 2 televisions * 1 week = $4
o Total Cost: $36 + $4 = $40

Based on these calculations, ordering 16 televisions would minimize your total cost. Therefore, you should order 16 televisions this week.

Verdict: The problem statement is OK, but GPT seems to have taken this as a one-shot static situation. It appears to be neglecting the possibility of reordering in following weeks, and therefore trying to cover the entire shipping pipeline with the order. Also, it’s misunderstood the stockout cost. In any case, it’s overreacting.

You

16 televisions sounds like a lot. What if I just replace the 8 sold by ordering 8 televisions? Or should I order more than 8 in order to replace depleted inventory?

ChatGPT

Ordering just 8 televisions to replace the ones sold might not be the best strategy because it doesn’t account for potential stockouts and the time it takes for new stock to arrive.

Let’s analyze the scenario if you only order 8 televisions:

1. *Demand:* You sold 8 televisions this week.
2. *Lead Time:* It takes 2 weeks to receive new televisions after ordering.
3. *Inventory Depletion:* If you only order 8 televisions, you won’t have enough stock to cover demand for the next 2 weeks.
4. *Potential Stockouts:* There’s a risk of running out of stock, leading to lost sales and stockout costs.

Given these factors, if you only order 8 televisions, you’ll likely face stockouts and incur additional costs. It’s better to order enough to cover the demand for the next 2 weeks and maintain a buffer to minimize the risk of stockouts.

Ordering 16 televisions allows you to cover the demand for the next 2 weeks and maintain your inventory level, considering the lead time for new stock to arrive. It helps minimize the total cost by balancing holding costs and potential stockout costs.

Therefore, ordering 16 televisions would be a more prudent approach to ensure you meet customer demand while minimizing costs in the long run.

Verdict: I’m not quite sure what it’s “thinking” here. For one thing, what I order this week will have no effect on stock for the next two weeks, due to the shipping delay. So if I’m going to be out of stock next week, that’s inevitable. I’ll try to help it out …

You

I forgot to mention that previously we have been ordering 4 televisions per week to replace those sold. That means there are already 8 televisions being shipped to us – 4 arriving next week, and 4 the week after. Does that change the answer?

ChatGPT

Thanks for the clarification. Considering that you have 4 televisions already on order for each of the next 2 weeks, that changes the situation. Let’s adjust the analysis:

1. *Current Inventory:* You have 8 televisions in inventory.
2. *Demand:* You sold 8 televisions this week.
3. *Lead Time:* You have 4 televisions arriving next week and another 4 arriving the week after.
4. *Total Demand to Cover:* 8 televisions sold this week + 8 televisions to cover the next 2 weeks = 16 televisions.
5. *Costs:* Holding cost is $1 per television per week, and the stockout cost is $2 per television per week.

With 16 televisions to cover the next 2 weeks’ demand, you have a couple of options:

*Ordering 8 televisions:* This would cover the immediate demand for the next week, but you would still have to cover the following week’s demand with the existing order. However, you risk stockouts for the second week’s demand.

*Ordering 16 televisions:* This would cover both weeks’ demand without any risk of stockouts. You would not need to hold excess inventory beyond what’s necessary for the next 2 weeks.

Given the potential stockout costs and the need to maintain inventory to cover future demand, ordering 16 televisions seems like the safer option. It ensures that you can meet customer demand for the next 2 weeks without incurring additional costs due to stockouts.

Verdict: I think it still doesn’t get the sequencing of orders and shipments, probably because it’s not really managing with an understanding of bathtub dynamics. I’ll see if it does better on the big picture:

You

I compromised and ordered 12 televisions this week. However, the supplier informed me that they are experiencing stockouts, and it might take 4 weeks to receive televisions in the future. How should I respond by changing my ordering next week?

ChatGPT

Given the new information about the supplier experiencing stockouts and the potential increase in lead time to 4 weeks, you’ll need to adjust your ordering strategy accordingly to mitigate the risk of running out of stock.

Here’s how you can respond by changing your ordering next week:

1. *Assess Current Inventory:* Evaluate how many televisions you have in stock after accounting for this week’s sales and the 12 televisions you ordered.

2. *Estimate Future Demand:* Consider the expected demand for the next 4 weeks based on historical sales data and any other relevant factors.

3. *Calculate Total Demand:* Add up the expected demand for each of the next 4 weeks.

4. *Account for Lead Time:* Since the lead time has increased to 4 weeks, you’ll need to order enough televisions to cover the expected demand during this extended lead time period.

5. *Consider Safety Stock:* Given the uncertainty with the supplier’s stockouts, you may want to increase your safety stock level to buffer against any unexpected fluctuations in demand or delays in delivery.

6. *Place Order:* Based on your assessment of current inventory, future demand, and the need for safety stock, place an order with the supplier for the necessary quantity of televisions to cover the next 4 weeks.

7. *Monitor and Adjust:* Continuously monitor inventory levels, sales data, and any updates from the supplier regarding lead times. Be prepared to adjust your ordering strategy as needed to maintain optimal inventory levels and minimize the risk of stockouts.

Considering the potential for longer lead times and the need to maintain adequate inventory levels during this period of uncertainty, you may want to order a higher quantity of televisions next week to ensure that you can meet customer demand and avoid stockouts.

Verdict: It’s retreated to a list of considerations, rather than proposing specific order numbers, which is probably good. It does mention “lead time” but it doesn’t quite come out and say “keep track of the product in the order pipeline,” which is a key omission. Also, it’s arbitrarily looking at a 4 week horizon.

You

I’ve been talking to some other retail stores, and it turns out that they’re all ordering extra televisions to build up inventory to offset the longer shipping lead times and supplier stockouts. Should I be concerned about this?

ChatGPT

It’s understandable that other retail stores are adjusting their ordering strategies to build up inventory in response to longer shipping lead times and supplier stockouts. However, whether you should be concerned depends on various factors specific to your own retail store, including your sales patterns, storage capacity, financial situation, and the level of risk you’re comfortable with.

Here are some considerations to help you evaluate whether you should be concerned:

1. *Sales Patterns:* Analyze your historical sales data to understand your typical demand patterns. If your sales are relatively stable and predictable, you may not need to build up as much inventory compared to stores with more fluctuating demand.

2. *Storage Capacity:* Consider whether you have sufficient storage space to accommodate extra inventory. Building up inventory can tie up valuable space and increase holding costs if you’re unable to sell the televisions quickly.

3. *Financial Situation:* Evaluate your cash flow and financial resources. Building up inventory requires additional investment, so ensure that you have the financial capacity to handle the increased inventory levels without negatively impacting your cash flow or liquidity.

4. *Supplier Relationships:* Assess the reliability of your supplier and the likelihood of continued stockouts or longer lead times. If you have concerns about your supplier’s ability to fulfill orders consistently, it may be prudent to build up inventory as a precautionary measure.

5. *Competitive Landscape:* Consider how your competitors’ actions might impact your business. If other retail stores are building up inventory to meet customer demand more effectively, you may need to adjust your strategy to remain competitive.

Ultimately, the decision to build up inventory should be based on a careful assessment of your unique circumstances and risk tolerance. While it’s important to be aware of industry trends and competitive dynamics, your focus should be on optimizing inventory levels to meet customer demand efficiently while minimizing costs and risks to your business. If you’re unsure about the best course of action, consulting with a financial advisor or industry expert may provide valuable insights tailored to your specific situation.

Verdict: The list of considerations 1-5 is not bad, but it’s not really on point to the important question, which is whether the bullwhip effect may come around to bite us.

You

Should I be concerned about the possibility that the current supply shortage will eventually be replaced by a supply glut, when manufacturer capacity catches up with these orders? I don’t want to get stuck with a lot of televisions I can’t sell, because suppliers interpreted strategic phantom ordering for real demand.

ChatGPT

Concerns about the possibility of a supply glut once manufacturer capacity catches up with current orders are valid, especially if there’s been a surge in orders due to strategic stockpiling by retailers. Being stuck with excess inventory can tie up capital, lead to storage issues, and potentially result in markdowns or losses if the televisions can’t be sold at full price.

To address this concern, consider the following strategies:

1. *Monitor Industry Trends:* Stay informed about industry trends, including updates on manufacturer capacity, production levels, and market demand. Keep track of news and reports from reliable sources to anticipate potential shifts in supply and demand dynamics.

2. *Maintain Flexibility:* Maintain flexibility in your ordering strategy to adapt to changing market conditions. Avoid overcommitting to large orders or long-term contracts that could leave you vulnerable to a sudden downturn in demand or an oversupply situation.

3. *Diversify Product Range:* Diversify your product range to reduce reliance on any single category or supplier. By offering a broader selection of products, you can mitigate the impact of supply fluctuations in any particular category and capitalize on emerging trends.

4. *Strengthen Supplier Relationships:* Cultivate strong relationships with your suppliers to facilitate open communication and collaboration. Work closely with your suppliers to share information about inventory levels, sales forecasts, and market trends, enabling them to better anticipate your needs and adjust production accordingly.

5. *Implement Just-in-Time Inventory:* Consider implementing a just-in-time inventory system to minimize excess inventory and optimize stock levels based on real-time demand signals. This approach can help you reduce carrying costs and mitigate the risk of overstocking.

6. *Monitor Inventory Levels:* Continuously monitor your inventory levels and sales performance to identify any signs of excess inventory or slow-moving products. Take proactive measures, such as offering promotions or discounts, to clear out excess inventory before it becomes a liability.

By staying vigilant, maintaining flexibility, and fostering strong relationships with suppliers, you can mitigate the risks associated with potential supply gluts and position your retail store to adapt effectively to changing market conditions.

Verdict: again this is a good list, and several of these items might be useful strategic advice for the supply glut problem. #6 mentions inventory monitoring, which is good, but there’s nothing about supply line monitoring, which is bad, though improving communication with suppliers (#4) and switching to JIT (#5) might help. Still, I feel it’s wise to heed the statement at the bottom of every page:

ChatGPT can make mistakes. Check important info.

AI for modeling – what (not) to do

Ali Akhavan and Mohammad Jalali have a nice new article in the SDR on the use of AI (LLMs) to complement simulation modeling.

Generative AI and simulation modeling: how should you (not) use large language models like ChatGPT

Ali Akhavan, Mohammad S. Jalali

Abstract

Generative Artificial Intelligence (AI) tools, such as Large Language Models (LLMs) and chatbots like ChatGPT, hold promise for advancing simulation modeling. Despite their growing prominence and associated debates, there remains a gap in comprehending the potential of generative AI in this field and a lack of guidelines for its effective deployment. This article endeavors to bridge these gaps. We discuss the applications of ChatGPT through an example of modeling COVID-19’s impact on economic growth in the United States. However, our guidelines are generic and can be applied to a broader range of generative AI tools. Our work presents a systematic approach for integrating generative AI across the simulation research continuum, from problem articulation to insight derivation and documentation, independent of the specific simulation modeling method. We emphasize while these tools offer enhancements in refining ideas and expediting processes, they should complement rather than replace critical thinking inherent to research.

It’s loaded with useful examples of prompts and responses:

I haven’t really digested this yet, but I’m looking forward to writing about it. In the meantime, I’m very interested to hear your take in the comments.

How Beauty Dies

I’m lucky to live in a beautiful place, with lots of wildlife, open spaces, relative quiet and dark skies, and clean water. Keeping it that way is expensive. The cost is not money; it’s the time it takes to keep people from loving it to death, or simply exploiting it until the essence is lost.

Fifty years ago, some far sighted residents realized that development would eventually ruin the area, so they created conservation zoning designed to limit density and preserve natural resources. For a long time, that structure attracted people who were more interested in beauty than money, and the rules enjoyed strong support.

However, there are some side effects. First, the preservation of beauty in a locale, when everywhere else turns to burbs and billboards raises values. Second, the low density needed to preserve resources creates scarcity, again raising property values. High valuations attract people who come primarily for the money, not for the beauty itself.

Every person who moves in exploits a little or a lot of the remaining resources, with the money people leaning towards extraction of as much as possible. As the remaining beauty degrades, fewer people are attracted for the beauty, and the nature of the place becomes more commercial.

A reinforcing feedback drives the system to a tipping point, when the money people outnumber the beauty people. Then they erode the regulations, and paradise is lost to a development free-for-all.

Defining SD

Open Access Note by Asmeret Naugle, Saeed Langarudi, Timothy Clancy: https://doi.org/10.1002/sdr.1762

Abstract
A clear definition of system dynamics modeling can provide shared understanding and clarify the impact of the field. We introduce a set of characteristics that define quantitative system dynamics, selected to capture core philosophy, describe theoretical and practical principles, and apply to historical work but be flexible enough to remain relevant as the field progresses. The defining characteristics are: (1) models are based on causal feedback structure, (2) accumulations and delays are foundational, (3) models are equation-based, (4) concept of time is continuous, and (5) analysis focuses on feedback dynamics. We discuss the implications of these principles and use them to identify research opportunities in which the system dynamics field can advance. These research opportunities include causality, disaggregation, data science and AI, and contributing to scientific advancement. Progress in these areas has the potential to improve both the science and practice of system dynamics.

I shared some earlier thoughts here, but my refined view is in the SDR now:


Invited Commentaries by Tom Fiddaman, Josephine Kaviti Musango, Markus Schwaninger, Miriam Spano: https://doi.org/10.1002/sdr.1763

More reasons to love emissions pricing

I was flipping through a recent Tech Review, and it seemed like every other article was an unwitting argument for emissions pricing. Two examples:

Job title of the future: carbon accountant

We need carbon engineers who know how to make emissions go away more than we need bean counters to tally them. Are we also going to have nitrogen accountants, and PFAS accountants, and embodied methane in iridium accountants, and … ? That way lies insanity.

The fact is, if carbon had a nontrivial price attached at the wellhead, it would pervade the economy, and we’d already have carbon accountants. They’re called accountants.

More importantly, behind those accountants is an entire infrastructure of payment systems that enforces conservation of money. You can only cheat an accounting system for so long, before the cash runs out. We can’t possibly construct parallel systems providing the same robustness for every externality we’re interested in.

Here’s what we know about lab-grown meat and climate change

Realistically, now matter how hard we try to work out the relative emissions of natural and laboratory cows, the confidence bounds on the answer will remain wide until the technology is used at scale.

We can’t guide that scaling process by assessments that are already out of date when they’re published. Lab meat innovators need a landscape in which carbon is priced into their inputs, so they can make the right choices along the way.

Model quality: draining the swamp for large models

In my high road post a while ago, I advocated “voluntary simplicity” as a way of avoiding a large model with an insurmountable burden of undiscovered rework.

Sometimes this is not a choice, because you’re asked to repurpose a large model for some related question. Maybe you didn’t build it, or maybe you did, but it’s time to pause and reflect before proceeding. It’s critical to determine where on the spectrum above the model lies – between the Vortex of Confusion and Nirvana.

I will assert from experience that a large model was almost always built with some narrow questions in mind, and that it was exercised mainly over parts of its state space relevant to those questions. It’s quite likely that a lot of bad behaviors lurk in other parts of the state space. When you repurpose the model those things are going to bite you.

If you start down the red road, “we’ll just add a little stuff for the new question…” you will find yourself in a world of hurt later. It’s absolutely essential that you first do some rigorous testing to establish what the limitations of the model might be in the new context.

The reason scope is so dangerous is that its effect on your ability to make quality progress is nonlinear. The number of interactions you have to manage, and therefore opportunities for errors, and complexity of corrections, is proportional to scope squared. The speed of your model declines with 1/scope, as does the time you have to pay attention to each variable.

My tentative recipe for success, or at least survival:

1. Start early. Errors beget more errors, so the sooner you discover them, the sooner you can arrest that vicious cycle.

2. Be ruthless. Don’t test to see if the model can answer the new question; test to see if you can break it and get nonsense answers.

3. Use your tools. Pay attention to unit errors and runtime warnings. Write Reality Checks to automate tests. Set ranges on key variables to ensure that they’re within reason.

4. Isolate. Because of the nonlinear interaction problem, it’s hard to interpret tests on a full model. Instead, extract components and test them in isolation. You can do this by copy-pasting, or even easier in Vensim, by using Synthesim Overrides to modify inputs to steps, ramps, etc.

5. Don’t let go. When you find a problem, track it back to its root cause.

6. Document. Keep a lab notebook, or an email stream, or a todo list, so you don’t lose the thought when you have multiple issues to chase.

7. Be extreme. Pick a stock and kick it with a pulse or an override. Take all the people out of the factory, or all the ships out of the fleet. What happens? Does anything go negative? Do decisions remain consistent with goals?

8. Calibrate. Calibration against data can be a useful way to find issues, but it’s much slower than other options, so this is something to pursue late in the process. Also, remember that model-data gaps are equally likely to reveal a problem with the data.

9. Rebuild. If you’re finding a lot of problems, you may be better of starting clean, using the existing model as a conceptual guide, but reconsidering the detailed design of the implementation.

10. Submodel. It’s often hard to fix something inside the full plate of spaghetti. But you may be able to identify a solution in an external submodel, free of distractions, and then transplant it back into the host.

11. Reduce. If you can’t rebuild the full scope within available resources, cut things out. This may not be appetizing to the client, but it’s certainly better than delivering a fragile model that only works if you don’t touch it.

12. If you find you’re in a hole, stop digging. Don’t add any features until you have things under control, because they’ll only exacerbate the problems.

13. Communicate. Let the client, and your team, know what you’re up to, and why quality is more important than their cherished kitchen sink.

Integration & Correlation

Claims by AI chatbots, engineers and Nobel prize winners notwithstanding, absence of correlation does not prove absence of causation, any more than presence of correlation proves presence of causation. Bard outlines several reasons from noise and nonlinearity, but missed a key one: bathtub statistics.

Here’s a really simple example of how this reasoning can go wrong. Consider a system with a stock Y(t) that integrates a flow X(t):

X(t) = -t

Y(t) = ∫X(t)dt

We don’t need to simulate to solve for Y(t) = -1/2*t^2 +C.

Over the interval t=[-1,1] the X and Y time series look like this:

The X-Y relationship is parabolic, with correlation zero:

Zero correlation can’t mean “not causal” because we constructed the system to be causal. Even worse, the sign of the relationship depends on the subset of the interval you examine:


This is not the only puzzling case. Consider instead:

X(t) = 1

Y(t) = ∫X(t)dt = t + C

In this case, X(t) has zero variance. But Corr(X,Y) = Cov(X,Y)/σ(X)σ(Y) which is 0/0. What are we to make of that?

This pathology can also arise from feedback. Consider a thermostat that controls a heater that operates in two states (on or off). If the heater is fast, and the thermostat is sensitive with a narrow temperature band, then σ(temperature) will be near 0, even though the heater is cycling with σ(heater state)>0.

AI Chatbots on Causality

Having recently encountered some major causality train wrecks, I got curious about LLM “understanding” of causality. If AI chatbots are trained on the web corpus, and the web doesn’t “get” causality, there’s no reason to think that AI will make sense either.

TLDR; ChatGPT and Bing utterly fail this test, for reasons that are evident in Google Bard’s surprisingly smart answer.

ChatGPT: FAIL

Bing: FAIL

Google Bard: PASS

Google gets strong marks for mentioning a bunch of reasons to expect that we might not find a correlation, even though x is known to cause y. I’d probably only give it a B+, because it neglected integration and feedback, but it’s a good answer that properly raises lots of doubts about simplistic views of causality.

Thyroid Dynamics: Hyper Resurgence

In my last thyroid post, I described a classic case of overshoot due to failure to account for delays. I forgot to mention the trigger for this episode.

At point B above, there was a low TSH measurement, at .05 well below the recommended floor of .4. That was taken as a signal for a dose reduction, which is qualitatively reasonable.

Let’s suppose we believe the dose-TSH response to be stable:

Then we have some puzzles. First, at 200mcg, we’d expect TSH=.15, about 3x higher. To get the observed measurement, we’d expect the dose to be more like 225mcg. Second, exactly the same reading of .05 has been observed at a much lower dose (162.5mcg), which is in the sweet spot (yellow box) we should be targeting. Third, also within that sweet spot, at 150mcg, we’ve seen TSH as high as 15 – far out of range in the opposite direction.

I think an obvious conclusion is that noise in the system is extreme, so there’s good reason to respond by discounting the measurement and retesting. But that’s not what happens in general. Here’s a plot of TSH observations (x, log scale) against subsequent dose adjustments (y, %):
There are three clusters of note.

  • The yellow-highlighted points are low TSH values that were followed by large dose reductions, exceeding guidelines.
  • The green points are large dose increases needed to restore the yellow changes, when they subsequently proved to be errors.
  • The purple points (3 total) are high TSH readings, right at the top of the recommended range, that did not induce a dose increase, even though they were accompanied by symptom complaints.

This is interesting, because the trendline seems to indicate a reasonable, if noisy, strategy of targeting TSH=1. But the operative decision rule for several of the doctors involved seems to be more like:

  • If you get a TSH measurement at the high end of the range, indicating a dose increase might be appropriate, ignore it.
  • If you get a low TSH measurement, PANIC. Cut dose drastically.
  • If you’re the doctor replacing the one just fired for screwing this up, restore the status quo.

Why is this? I think it’s an error in reasoning. Low TSH could be caused by excessive T4 levels, which could arise from (a) overtreatment of a hypothyroid patient, or (b) hyperthyroid activity in a previously hypothyroid patient. In the case described previously, evidence from T4 testing as well as the long term relationship suggested that the dose was 20-30% high, but it was ultimately reduced by 60%. But in two other cases, there was no T4 confirmation, and the dose was right in the middle of its apparent sweet spot. That rules out overtreatment, so the mental model behind a dose reduction has to be (b). But that makes no sense. It’s a negative feedback system, yet somehow the thyroid has increased its activity, in response to a reduction in the hormone that normally signals it to do so? Admittedly, there are possibilities like cancer that could explain such behavior, but no one has ever explored that possibility in N=1’s case.

I think the basic problem here is that it’s hard to keep a mechanistic model of a complex hormone signalling system in your head, which makes it easy to get fooled by delays, feedback, noise and nonlinearity. Bad information systems and TSH monomania contribute to the problem, as does ignoring dose guidelines due to overconfidence.

So what should happen in response to a low TSH measurement in patient N=1? I think it’s more like the following:

  • Don’t panic.
    • It might be a bad measurement (labs don’t correct for seasonality, time of day, and other features that could inflate variance beyond the precision of the test itself).
    • It might be some unknown source of variability driving TSH, like food, medications, or endogenous variation in upstream hormones.
  • Look at the measurement in context of other information: the past dose-response relationship, T4, and symptoms, and reference dose per unit body mass.
  • Make at most a small move, wait for the guideline-prescribed period, and retest.

Thyroid Dynamics: Dose Management Challenges

In my last two posts about thyroid dynamics, I described two key features of the information environment that set up a perfect storm for dose management:

  1. The primary indicator of the system state for a hypothyroid patient is TSH, which has a nonlinear (exponential) response to T3 and T4. This means you need to think about TSH on a log scale, but test results are normally presented on a linear scale. Information about the distribution is hard to come by. (I didn’t mention it before, but there’s also an element of the Titanic steering problem, because TSH moves in a direction opposite the dose and T3/T4.)
  2. Measurements of TSH are subject to a rather extreme mix of measurement error and driving noise (probably mostly the latter). Test results are generally presented without any indication of uncertainty, and doctors generally have very few data points to work with.

As if that weren’t enough, the physics of the system is tricky. A change in dose is reflected in T4 and T3, then in TSH, only after a delay. This is a classic “delayed negative feedback loop” situation, much like the EPO-anemia management challenge in the excellent work by Jim Rogers, Ed Gallaher & David Dingli.

If you have a model, like Rogers et al. do, you can make fairly rapid adjustments with confidence. If you don’t, you need to approach the problem like an unfamiliar shower: make small, slow adjustments. If you react two quickly, you’ll excite oscillations. Dose titration guidelines typically reflects this:

Titrate dosage by 12.5 to 25 mcg increments every 4 to 6 weeks, as needed until the patient is euthyroid.

Just how long should you wait before making a move? That’s actually a little hard to work out from the literature. I asked OpenEvidence about this, and the response was typically vague:

The expected time delay between adjusting the thyroid replacement dose and the response of thyroid-stimulating hormone (TSH) is typically around 4 to 6 weeks. This is based on the half-life of levothyroxine (LT4), which reaches steady-state levels by then, and serum TSH, which reaches its nadir at the same time.[1]

The first citation is the ATA guidelines, but when you consult the details, there’s no cited basis for the 4-6 weeks. Presumably this is some kind of 3-tau rule of thumb learned from experience. As an alternative, I tested a dose change in the Eisenberg et al. model:

At the arrow, I double the synthetic T4 dose on a hypothetical person, then observe the TSH trajectory. Normally, you could then estimate the time constant directly from the chart: 70% of the adjustment is realized at 1*tau, 85% at 2*tau, 95% at 3*tau. If you do that here, tau is about 8 days. But not so fast! TSH responds exponentially, so you need to look at this on a log-y scale:


Looking at this correctly, tau is somewhat longer: about 12-13 days. This is still potentially tricky, because the Eisenberg model is not first order. However, it’s reassuring that I get similar time constants when I estimate my own low-order metamodel.

Taking this result at face value, one could roughly say that TSH is 95% equilibrated to a dose change after about 5 weeks, which corresponds pretty well with the ATA guidelines.


This is a long setup for … the big mistake. Referring to the lettered episodes on the chart above, here’s what happened.

  • A: Dose is constant at about 200mcg (a little hard to be sure, because it was a mix of 2 products, and the equivalents aren’t well established.
  • B: New doctor orders a test, which comes out very low (.05), out of the recommended range. Given the long term dose-response range, we’d expect about .15 at this dose, so it seems likely that this was a confluence of dose-related factors and noise.
  • C: New doc orders an immediate drastic reduction of dose by 37.5% or 75mcg (3 to 6 times the ATA recommended adjustment).
  • D: Day 14 from dose change, retest is still low (.2). At this point you’d expect that TSH is at most 2/3 equilibrated to the new dose. Over extremely vociferous objections, doc orders another 30% reduction to 88mcg.
  • E: Patient feeling bad, experiencing hair loss and other symptoms. Goes off the reservation and uses remaining 125mcg pills. Coincident test is in range, though one would not expect it to remain so, because the dose changes are not equilibrated.
  • F: Suffering a variety of hypothyroid symptoms at the lower dose.
  • G: Retest after an appropriate long interval is far out of range on the high side (TSH near 7). Doc unresponsive.
  • H: Fired the doc. New doc restores dose to 125mcg immediately.
  • I: After an appropriate interval, retest puts TSH at 3.4, on the high side of the ATA range and above the NACB guideline. Doc adjusts to 175mcg, in part considering symptoms rather than test results.

This is an absolutely classic case of overshooting a goal in a delayed negative feedback system. There are really two problems here: failure to anticipate the delay, and therefore making a second adjustment before the first was stabilized, and making overly aggressive changes, much larger than guidelines recommend.

So, what’s really going on? I’ve been working with a simplified meta version of the Eisenberg model to figure this out. (The full model is hourly, and therefore impractical to run with Kalman filtering over multi-year horizons. It’s silly to use that much computation on a dozen data points.)

The problem is, the model can’t replicate the data without invoking huge driving noise – there simply isn’t any thing in the structure that can account for data points far from the median behavior. I’ve highlighted a few above. At each of these points, the model takes a huge jump, not because of any known dynamics, but because of a filter reset of the model state. This is a strong hint that there’s an unobserved state influencing the system.

If we could get docs to provide a retest at these outlier points, we could at least rule out measurement error, but that has almost never happened. Also, if docs would routinely order a full panel including T3 and T4, not just TSH, we might have a better mechanistic explanation, but that has also been hard to get. Recently, a doc ordered a full panel, but office staff unilaterally reduced the scope to TSH only, because they felt that testing T3 and T4 was “unconventional”. No doubt this is because ATA and some authors have been shouting that TSH is the only metric needed, and any nuances that arise when the evidence contradicts get lost.

For our N=1, the instability of the TSH/T4 relationship contradicts the conventional wisdom, which is that individuals have a stable set point., with the observed high population variation arising from diversity of set points across individuals:

I think the obvious explanation in our N=1 is that some individuals have an unstable set point. You could visualize that in the figure above as moving from one intersection of curves to another. This could arise from a change in the T4->TSH curve (e.g. something upstream of TSH in the hypothalamic-pituitary-adrenal axis) or the TSH->T4 relationship (intermittent secretion or conversion). Unfortunately very few treatment guidelines recognize this possibility.