Misperceptions of Nonlinearity

I’ve been working on thyroid dynamics, tracking a friend’s data and seeking some understanding with models. With only one patient to go on, it’s impossible to generalize about thyroid behavior from one time series (though there are some interesting features I’ll report on later). On the other hand, our sample size of doctors is now around 10, and I’m starting to see some persistent misperceptions leading to potentially dangerous errors. One of the big issues is simple.

The principal indicator used for thyroid diagnosis is TSH (thyroid stimulating hormone). TSH regulates production of T4 and T3, T3 being the metabolically active hormone. T4 and T3 in turn downregulate TSH (via TRH), producing a negative feedback loop. The basis for the target range for TSH in thyroid treatment is basically the distribution of TSH in the general population without a thyroid diagnosis.

The challenge with TSH is that its response is logarithmic, so its distribution is lognormal. The usual target range is 0.4 to 4 mIU/L (or .45 to 4.5, or something else, depending on which source you prefer). Anyway, suppose you test at 2.2 – bingo! right in the middle! Well, not so fast. The geometric mean of .4 and 4.4 is actually 1.6, so you’re a little high.

How high? Well, no one will tell you without a fight. For some reason, most sources insist on throwing out much of the relevant information about the distribution of “normal”. In fact, when you look at the large survey papers reporting on population health, like NHANES, it’s hard to find the distribution. For example, Thyroid Profile of the Reference United States Population: Data from NHANES 2007-2012 (Jain 2015) doesn’t have a single visualization of the data – just a bunch of tables. When you do find the distribution, you’ll often get a subset (smokers) or a linear-scaled version that makes it hard to see the left tail. (For the record, TSH=2.2 is just above the 75th percentile in the NHANES THYROD_G dataset – so already quite far from the middle.)

There are also more subtle issues. In the first NHANES thyroid survey article, I found Fig. 1:

Here we have a log scale, but bins of convenience. The breakpoints 1, 2, 3, 5, 10… happen to be roughly equally spaced on a log scale. But when you space histogram bins at these intervals, the bin width is very rough indeed – varying by almost a factor of 2. That means the shape of the distribution is distorted. One of the bins expected to be small is the 2.1-3 range, and you can actually see here that those columns look anomalously low, compared to what you’d expect of a nice bell curve.

That was 20 years ago, but things are no better with modern analytics. If you get a blood test now, your results are likely to be reported to you, and your doc, though a system like MyChart. For TSH, this means you’ll get a range plot with a linear scale:

backed up by a time series with a linear scale:

Notice that the “normal” range doesn’t match the ATA recommendation or any other source I’ve seen. Presumably it’s the lab’s +/- 2 standard deviation range or something like that. That’s bad, because the upper limit – 4.82 – is above every clinical association recommendation I’ve seen. The linear scale squashes all the variation around low values, and exaggerates the high ones.

Given that the information systems for the entire thyroid management enterprise offer biased, low-information displays of TSH stats, I think it’s not surprising that physicians have trouble overcoming the nonlinearity bias. They have hundreds of variables to think about, and can’t possibly be expected to maintain a table of log or Z transformations in their heads. Patients are probably even more baffled by the asymmetry.

It would be simple to remedy this by presenting the information in a way that minimizes the cognitive burden on the viewer. Reporting TSH on a log scale is trivial. Reporting percentiles as a complementary statistic would also be trivial, and percentiles are widely used, so you don’t have to explain them. Setting a target value rather than a target range would encourage driving without bouncing off the guardrails. I hope authors and providers can figure this out.

1975 – the CIA Evaluates SD

I just had a great time at the MIT SD Group for a Friday seminar. Lots to think about! Hopefully I can report on a few topics later.

In the meantime, Hesam Mahmoudi showed me a fun tidbit (via Navid Ghaffarzadegan). It’s a declassified CIA evaluation of MIT SD developments circa 1975, which one contributor refers to as the “Forrester cult.” I can’t believe I haven’t seen this before.

The first report is an interesting read, but mainly for its naïvely arrogant clever snark. The author appears to have completely missed the point.

I’d be interested to know what models specifically are “the same kinds” as industrial dynamics. AFAIK, economics was pretty solidly entrenched in econometrics at the time, and that had little to do with dynamics. Dynamic models were not unknown, including the Ramsey model (solved analytically), the Samuelson multiplier-accelerator (oops), and the hydraulic Phillips machine, but they were hardly mainstream.

Well, actually, we mostly use differential equations, because discrete time stinks. But that’s a minor point.

“Snake diagram” … oil … get it? Snake oil?

The author implements this as:

This doesn’t actually have much to do with SD. No one would formulate a market clearing mechanism this way, because the discrete time implementation has obvious flaws, including conflating the time step with the time scale of the price adjustment process. The initial condition for price is also omitted.

The “striking similarity between this model and good old supply-and-demand analysis” is clearly referencing the familiar plot of the intersection of supply and demand curves, which is generally about consumer surplus, taxation, technical shifts, etc. – nothing to do with dynamics. Instead, this is the cobweb model:

Obviously the dynamics lie on the supply and demand curves, but except for a trivial equilibrium point, it’s an oscillator, damped over half its parameter space, but explosive in the other half. This is basically a big exercise in DT error. The degree of damping depends on the relative slopes of the supply and demand curves, which is problematic because we don’t necessarily expect oscillatory behavior from models with stiff elasticities in the real world. The discrete time specification neglects the time constant of the adjustment process; slow adjustment is not the same as low elasticity, but the two are conflated here. This is actually a common problem in econometric models and might partly explain why short term and long term elasticity estimates overlap.

Finally, we get the old red herring, that SD models are over-parameterized:

This is just silly, and also deeply ironic because omitted structure in the author’s proposed model (presumably to avoid having an explicit time constant) would seriously bias parameter estimates. It also embodies the common but wrong view that estimates from the particular data in an analysis are the only information in the universe that can inform a model.

It’s too bad the first author was too busy being dismissive to develop a proper critique, because we all might have learned something from that. Interestingly, though, a second author in the file came away with a different conclusion:

Correlation & causation – it’s complicated

It’s common to hear that correlation does not imply causation. It’s certainly true in the strong sense that observing a correlation between X and Y does not prove causation X->Y, because the true causality might be Y->X or Z->(X,Y).

Some wag (Feynman?) pointed out that “correlation does not imply causation, but it’s a good start.” This is also true. If you’re trying to understand the causes of Y, data mining for things that are correlated is a useful exploratory step, even if it proves nothing. If you find something, then you can look for plausible mechanisms, try experiments, etc.

Some go a little further than this, and combine Popper’s falsification with causality criteria to argue that lack of correlation does imply lack of causation. Unfortunately, this is untrue, for a number of reasons:

  1. Measurement error – in OLS regression, the slope is just the correlation coefficient normalized by standard deviations. However, if there’s measurement error in the RHS variables, not just equation error affecting the LHS, the slope is affected by attenuation bias. In other words, a poor signal to noise ratio destroys apparent correlation, even when causality is present.
  2. Integration – bathtub dynamics renders pattern matching incorrect, and destroys correlations, even in synthetic data experiments where causation is known to exist.
  3. Nonlinearity – there are many possible bivariate patterns that result in a linear correlation coefficient of 0 despite an obvious (possibly causal) relationship.

Most systems have all three of these features to some extent, and they gain strength in combination. Noise integrates into the system stocks, and the slope or correlation of a relationship may reverse, depending on system state. Sugihara et al. show that Granger Causality fails, because “in deterministic dynamic systems (even noisy ones), if X is a cause for Y, information about X will be redundantly present in Y itself and cannot formally be removed….”

The common thread here is that no method can say much about causality if the assumptions neglect features of the system dynamics (integration or nonlinearity) or stochastic processes (measurement error and driving noise). Sometimes you get lucky, because you have a natural experiment, or high precision measurements, or simply loads of data about benign dynamics, but luck rarely coincides with big novel problems. Presence or absence of correlation is suggestive but far from definitive.

The Way Out

I’ve previously advised that a hyper-vigilant emphasis on model quality is the only viable path to a good model approaching the scope you hope for. That’s the green path.

However, it’s likely you will be captured by the red path at times. Client appetite for scope, desire for rapid perceived progress, the apparent ease of adding features, and existing models that aren’t as good as they box they came in advertised are all seductive.

Once you’ve overextended on scope, the standard vicious cycles in project models kick in:

  • errors beget errors
  • large models are slow and hard to test
  • errors mask other errors, making rework discovery more difficult
  • time pressure -> fatigue -> morale -> errors

So how do you get back on the righteous path? I think there are three options, but only 1.5 work.

The red path is tempting, because you can preserve the illusion of progress on scope. It will seldom work though. You’re unlikely to be able to inspect quality into an enlarged model later. You might progress to the right, but you won’t progress up. The orange path, a compromise of mild simplification and aggressive improvement, might work, but it’s going to hurt.

You’re better off to pursue the blue path, which essentially means reconstructing a better, simpler model, even at the expense of perceived functionality. Step 1: you’re in a hole, so stop digging. Productive things you might do include:

  • Suspend feature enhancements
  • Do extreme conditions tests – LOTS of them
  • Calibrate to data or run policy optimizations, as another kind of test (the algorithm will exploit weakenesses)
  • Dismantle sectors into standalone submodels that are easier to test and redesign
  • Aggressively clean up diagrams
  • Have a zero-tolerance policy for unit errors and runtime warnings
  • Document equations
  • Conduct team design reviews
  • Keep a trail of your model versions and components, so you can backtrack and later restore things you have to rip out

Once you’re back in shape, continue these disciplines. They’ll keep you on a path that stays far from the vortex.

Held v Montana

The Montana climate case, Held vs. State of Montana, has just turned in a win for youth.

The decision looks pretty strong. I think the bottom line is that the legislature’s MEPA exclusions preventing consideration of climate in state regulation are a limitation of the MT constitutional environmental rights, and therefore require strict scrutiny. The state failed to show that the MEPA Limitation serves a compelling government interest.

Not to diminish the accomplishments of the plaintiffs, but the state put forth a very weak case. The Montana Supreme Court tossed out AG Knudsen’s untimely efforts to send the case back to the drawing board. The state’s own attorney, Thane Johnson, couldn’t get acronyms right for the IPCC and RCPs. That’s perhaps not surprising, given that the Director of Montana’s alleged environmental agency admitted unfamiliarity with the largest scientific body related to climate,

Montana’s top witnesses — state employees who are responsible for permitting fossil fuel projects — however, acknowledged they are not well-versed in climate science and at times struggled with the many acronyms used in the case.

Chris Dorrington, director of the Montana Department of Environmental Quality, told an attorney for the youth that he had been unaware of the U.N. Intergovernmental Panel on Climate Change (IPCC) — which has issued increasingly dire assessments since it was established more than 30 years ago to synthesize global climate data.

“I attended this trial last week, when there was testimony relevant to IPCC,” Dorrington said. “Prior to that, I wasn’t familiar, and certainly not deeply familiar with its role or its work.”

As noted by Judge Seeley, the state left much of the plaintiffs’ evidence uncontested. They also declined to call their start witness on climate science, Judith Curry, who reflects:

MT’s lawyers were totally unprepared for direct and cross examination of climate science witnesses. This was not surprising, since this is a very complex issue that they apparently had not previously encountered. One lawyer who was cross-examining the Plaintiffs’ witnesses kept getting confused by ICP (IPCC) and RPC (RCP). The Plaintiffs were very enthusiastic about keeping witnesses in reserve to rebut my testimony, with several of the Plaintiffs’ witnesses who were leaving on travel presenting pre-buttals to my anticipated testimony during their direct questioning – all of this totally misrepresented what was in my written testimony, and can now be deleted from the court record since I didn’t testify. I can see that all of this would have turned the Hearing into a 3-ring climate circus, and at the end of all that I might not have managed to get my important points across, since I am only allowed to respond to questions.

On Thurs eve, I received a call from the lead Montana lawyer telling me that they were “letting me off the hook.” I was relieved to be able to stay home and recapture those 4 days I had scheduled for travel to and from MT.

The state’s team sounds pretty dysfunctional:

Montana’s approach to the case has evolved since 2020, has evolved rapidly in the last 6 months since a new legal team was brought in, and even evolved rapidly during the course of the trial.  The lawyers I spoke to in Sept 2022 were gone by the end of Oct, with an interim team brought in from the private sector, and then a new team that was hired for the Montana’s State Attorney’s Office in Dec.

MT’s original expert witnesses were apparently tossed, and I and several other expert witnesses were brought on board in the 11th hour, around Sept 2022. Note:  instructions for preparing our written reports were received from lawyers two generations removed from the actual trial lawyers.  As per questioning during my Deposition, I gleaned that the state originally had a collection of witnesses that were pretty subpar (I don’t know who they were).  The new set of witnesses was apparently much better.

If the state has such a compelling case, why can’t they get their act together?

In any case, I find one argument in all of this really disturbing. Suppose we accept Curry’s math:

With regards to Montana’s CO2 emissions, based on 2019 estimates Montana produces 0.63% of U.S. emissions and 0.09% of global emissions.  For an anticipated warming of 2oC, Montana’s 0.09% of emissions would account for 0.0018oC of warming.  There are other ways to frame this calculation (and more recent numbers), but any way you slice it, you can’t come up with a significant amount of global warming that is caused by Montana’s emissions.

Never mind that MT is also only .0135% of global population. If you get granular enough, every region is a tiny fraction of the world in all things. So if we are to imagine that “my contribution is small” equates to “I don’t have to do anything about the problem,” then no one has to do anything about climate, or any other global problem for that matter. There’s no role for leadership, cooperation or enlightened self-interest. This is a circular firing squad for global civilization.

What is accumulation?

The SD Society posted a definition of accumulation on Facebook, and it caught my eye.

This is from the SD Glossary, by David Ford.

accumulation (integration) : a gradual, non-instantaneous increase or decrease of a quantity over time. An accumulator is also referred to as a stock or level and represents the state of a system. To accumulate is the act of increasing and decreasing the size of a state variable (a stock) over time.

I wrote,

I’m not a fan of this definition. Accumulation is not necessarily gradual or non-instantaneous. In fact, it’s quite common to accumulate a flow pulse to produce an abrupt step in a stock. The key feature of accumulation is that it’s, well, cumulative. I’m at a loss for a way to express that without mentioning integration, which won’t help most people. Maybe someone can do better?

I think it’s telling that we don’t have ready words to describe accumulation. That might be a symptom, or a cause, of our problematic mental models about bathtub dynamics and bathtub statistics.

Resorting to “integration” isn’t really helpful, except to the mathematically inclined, which is not the audience for this kind of description I think.

The dictionary definition of “cumulative” turns out to be helpful:

increasing by successive additions

With that in mind, I’d propose something like:

  • accumulation : increasing by successive additions, or decreasing by successive subtractions.
  • stock (level) : A variable representing a persistent state in a system, which can be considered the memory of the system. Stocks change by accumulation of flows.
  • flow (rate): A variable that contributes to cumulative change in a stock over time. Flows represent activity or change in a system. A flow may represent the movement of physical quantities between stocks within a system boundary or across the model boundary and thereby into or out of the system (sinks and sources), or the rate of change of a nonphysical or intangible state.

Note that it’s hard to discuss accumulation without also discussing stocks and flows, so I’ve modified all three glossary entries.

What is SD?

Asmeret Naugle, Saeed Langarudi, Timothy Clancy propose to define System Dynamics in a new paper.

The defining characteristics are: (1) models are based on causal feedback structure, (2) accumulations and delays are foundational, (3) models are equation-based, (4) concept of time is continuous, and (5) analysis focuses on feedback dynamics.

I like the paper, but … not so fast. I think more, and more flexible, criteria are needed. I would use the term “characterize” rather than “define.” The purpose should be to aid recognition of SD, and hopefully good SD, without drawing too tight a box around the field.

I particularly disagree with the inclusion of continuous time. Even though discrete time stinks, I think continuous time is a common but inessential feature, like continuous flows. Many models include occasional discrete events, and sometimes they’re important. Ventity’s actions are explicit discrete events between time steps, and they may modify model structure in ways that are key to an operational representation of reality.

My top-of-mind alternative framework looks like:

I think it’s also helpful to describe things that are not SD:

  • Intertemporal optimization or rational expectations representing behavior
  • Computable general equilibrium
  • Linear regression
  • Linear programming
  • Mixed integer programming
  • Social Network Analysis (static)
  • Discrete ABM
  • Discrete event simulation
  • Equilibrium
  • Simultaneity

Sometimes it’s easier to see the negative space, but there are exceptions to these rules.

I think it’s notable that both frameworks exclude a variety of qualitative systems thinking approaches, like group model building or elicitation methods that create CLDs rather than simulatable models. I’m a big tent fan, and certainly some of the exceptions are common at the SD conference, but does that make them SD?

I think behavior is another challenging feature to describe. In my mind, System Dynamics is almost synonymous with behavioral dynamics. If you’re building an economic model in which agents explicitly know the future (e.g., via intertemporal optimization), it’s not an SD model (though you might be using it as a comparison case for some SD purpose). Yet there’s a strong tradition of prize-winning biomedical models that lack behavior because they lack human agency. These are not easily distinguishable from what other fields might call ODEs or nonlinear dynamics. I would not want to eject those from the field, but neither would I want this to become our focus.

I’ll be interested to see how the conversation evolves on this.

Mental Models vs. Models in the Loop

Timothy Clancy, Saeed P. Langarudi and Raafat Zaini have an interesting new commentary in the SDR.

Never the strongest: reconciling the four schools of thought in system dynamics in the debate on quality

With the passing of Jay Forrester, the field of system dynamics exists at a similar crossroads. Debates of implicit, if not explicit, inheritance and future direction are already breaking out among competing generals. Who owns Forrester’s legacy? Will we proceed down the reference mode of the Macedonian and Mughal Empires—or will we instead seek an alternative reference mode of Alexandria: integration, reconciliation, and mutually recognized coexistence of different schools within the broader field of system dynamics?

We suggest the latter path—and that begins by recognizing at least four, if not more, distinct schools of thought on how to approach system dynamics and the study of complex systems. We believe these schools arise from differing mental models in the field and the consequences that arise in practice from these differences.

I haven’t really absorbed it yet, so I’ll refrain from direct comment, but it did spur me to finish off a draft of some similar thoughts on these questions.

I personally lean very much toward the hard science, data-driven side of the field: what the authors call the Empirical school of thought. But as a policy, I lean toward a big tent view of the field that includes work with low model content (which I don’t equate with low quality).

I think the central tension in the debate has already been posed by JWF and others long ago – all the way back to Industrial Dynamics really. In Some Basic Concepts in System Dynamics (2009), Forrester summarized,

The basic feedback loop in Figure 4 is too simple to represent real-world situations. But simple loops have more serious shortcomings—they are misleading and teach the wrong lessons. Most of our intuitive learning comes from very simple systems. The truths learned from simple systems are often completely opposite from the behavior of more complex systems. A person understands filling a water glass, as in Figure 3. But, if we go to a system that is only five times as complicated, as in Figure 5, intuition fails. A person cannot look at Figure 5 and anticipate the behavior of the pictured system.

Figure 5 from World Dynamics is five times more complicated than Figure 4 in the sense that it has five stocks—the rectangles in the figure. The figure shows how rapidly apparent complexity increases as more system stocks are added.

Mathematicians would describe Figure 5 as a fifth-order, nonlinear, dynamic system. No one can predict the behavior by studying the diagram or its underlying equations. Only by using computer simulation can the implied behavior be revealed.

I think the message is pretty clear here. To solve complex problems, you must formally simulate the system because mental simulations are treacherous. I’d go even one step further, and argue that it’s not sufficient to simulate the system once, figure out where the leverage point is, implement the solution, and toss out the model when you’re done. The simulation needs to become an ongoing part of the loop for model predictive control.

If that’s the ideal, why settle for anything less? I think there are a number of possible answers.

  1. Even in a perfect world where it’s easy to construct the model-in-the-loop, you need buy-in from the participants in the system to implement the model, and that requires a skillset that’s quite distinct.
  2. While it’s true that no one can intuit the behavior of a 10th order system, there might be a lot of value in managing low-order components of the system that are amenable to mental simulation or simple decision rules. There might be two reasons for this:
    • The complex system is dominated by a few key parameters (as in sloppy systems).
    • Risking global suboptimization by improving locally is better than optimizing nothing (though this might be a matter of luck).
  3. Often no single stakeholder in the system has the resources or authority to implement needed changes. But exposing the connectivity of the system, even if you can’t predict exactly how it works, is sometimes enough to catalyze creation of higher-level structures that enable change in the future.
  4. Not everyone is, or wants to be, a modeler. Moreover some participants in the system may reject models, data, and pretty much everything else since the Enlightenment, but you still have to include them.
  5. The non-modelers, as participants in the system, hold key knowledge that the modelers need.
  6. A qualitative map of a system is a good start towards an eventual quantitative model.
  7. Not every problem is big enough to model.

I’m sure you can probably think of more. I think these are good reasons to embrace non-model-based work on systems, as long as one refrains from making strong predictions about behavior from incomplete descriptions of behavior. Fortunately that leaves a lot of interesting things to think about.

I think the opposite perspective, that nothing is worth doing without a model and data, requires some counterexamples. Are there instances in which a group mapping exercise, playing a dynamic game, or engaging in cross-functional dialog led to reduced performance? I’m not aware of good examples of this, and certainly not of good diagnoses of the outcome. Attribution in complex systems is notoriously difficult. I think what this suggests is that we need stronger links to the evaluation research community, because we don’t really know what works and what doesn’t. We already have some strength in this area from the dynamic decision making experiment thread of SD, but … physician heal thyself.

There is one thing that troubles me though, just beyond the boundaries of our field. It’s climate policy (and related global issues). Most climate policy advocates are in some sense systems thinkers. Many build nice diagrams or use other systemic tools. If you don’t care about systems, it’s hard to see why you’d care about climate to begin with.

Yet … it seems that a substantial fraction of people who are pro-climate policy favor policies that are counterproductive or insufficient. They like low-carbon fuel standards that are unstable, inefficient, and can even increase emissions. They like standards that allocate more property rights to bigger polluters, or simply make it harder to change. They like to impose constraints on new fossil supply that work exactly like OPEC to increase prices and profits for incumbent producers. They subsidize EVs and solar, increasing the incentive to consume energy and congest roads, with benefits accruing to the rich who can afford the capital outlay.

What this means is that my reason #2, “Risking global suboptimization by improving locally is better than optimizing nothing,” isn’t working out too well. I think this is exactly the kind of counterintuitive behavior of social systems that JWF was referring too. I don’t believe you can sort these things out with CLDs or other qualitative methods, except perhaps when they are used as explanatory tools for underlying formal models.

I think the bottom line is that, inside the big tent, the tall pole must remain construction and validation of robust behavioral dynamic models.

Feedback is Interdisciplinary

Quite a while ago, I wrote about modeling the STEM workforce:

An integrated model needs three things: what, how, and why. The “what” is the state of the system – stocks of students, workers, teachers, etc. in each part of the system. Typically this is readily available – Census, NSF and AAAS do a good job of curating such data. The “how” is the flows that change the state. There’s not as much data on this, but at least there’s good tracking of graduation rates in various fields, and the flows actually integrate to the stocks. Outside the educational system, it’s tough to understand the matrix of flows among fields and economic sectors, and surprisingly difficult even to get decent measurements of attrition from a single organization’s personnel records. The glaring omission is the “why” – the decision points that govern the aggregate flows. Why do kids drop out of science? What attracts engineers to government service, or the finance sector, or leads them to retire at a given age? I’m sure there are lots of researchers who know a lot about these questions in small spheres, but there’s almost nothing about the “why” questions that’s usable in an integrated model.

I think the current situation is a result of practicality rather than a fundamental philosophical preference for analysis over synthesis. It’s just easier to create, fund and execute standalone micro research than it is to build integrated models.

According to Jay Forrester, Gordon Brown said it much more succinctly:

The message is in the feedback, and the feedback is inherently

AI doesn’t help modelers

Large language model AI doesn’t help with modeling. At least, that’s my experience so far.

DALL-E images from Bing image creator.

On the ACM blog, Bertrand Meyer argues that AI doesn’t help programmers either. I think his reasons are very much compatible with what I found attempting to get ChatGPT to discuss dynamics:

Here is my experience so far. As a programmer, I know where to go to solve a problem. But I am fallible; I would love to have an assistant who keeps me in check, alerting me to pitfalls and correcting me when I err. A effective pair-programmer. But that is not what I get. Instead, I have the equivalent of a cocky graduate student, smart and widely read, also polite and quick to apologize, but thoroughly, invariably, sloppy and unreliable. I have little use for such  supposed help.

He goes on to illustrate by coding a binary search. The conversation is strongly reminiscent of our attempt to get ChatGPT to model jumping through the moon.

And then I stopped.

Not that I had succumbed to the flattery. In fact, I would have no idea where to go next. What use do I have for a sloppy assistant? I can be sloppy just by myself, thanks, and an assistant who is even more sloppy than I is not welcome. The basic quality that I would expect from a supposedly intelligent  assistant—any other is insignificant in comparison —is to be right.

It is also the only quality that the ChatGPT class of automated assistants cannot promise.

I think the fundamental problem is that LLMs aren’t “reasoning” about dynamics per se (though I used the word in my previous posts). What they know is derived from the training corpus, and there’s no reason to think that it reflects a solid understanding of dynamic systems. In fact there are presumably lots of examples in the corpus of failures to reason correctly about dynamic causality, even in the scientific literature.

This is similar to the reason AI image creators hallucinate legs and fingers: they know what the parts look like, but they don’t know how the parts work together to make the whole.

To paraphrase Meyer, LLM AI is the equivalent of a polite, well-read assistant who lacks an appreciation for complex systems, and aggressively indulges in laundry-list, dead-buffalo thinking about all but the simplest problems. I have no use for that until the situation improves (and there’s certainly hope for that). Worse, the tools are very articulate and confident in their clueless pronouncements, which is a deadly mix of attributes.

Related: On scientific understanding with artificial intelligence | Nature Reviews Physics