AI doesn’t help modelers

Large language model AI doesn’t help with modeling. At least, that’s my experience so far.

DALL-E images from Bing image creator.

On the ACM blog, Bertrand Meyer argues that AI doesn’t help programmers either. I think his reasons are very much compatible with what I found attempting to get ChatGPT to discuss dynamics:

Here is my experience so far. As a programmer, I know where to go to solve a problem. But I am fallible; I would love to have an assistant who keeps me in check, alerting me to pitfalls and correcting me when I err. A effective pair-programmer. But that is not what I get. Instead, I have the equivalent of a cocky graduate student, smart and widely read, also polite and quick to apologize, but thoroughly, invariably, sloppy and unreliable. I have little use for such  supposed help.

He goes on to illustrate by coding a binary search. The conversation is strongly reminiscent of our attempt to get ChatGPT to model jumping through the moon.

And then I stopped.

Not that I had succumbed to the flattery. In fact, I would have no idea where to go next. What use do I have for a sloppy assistant? I can be sloppy just by myself, thanks, and an assistant who is even more sloppy than I is not welcome. The basic quality that I would expect from a supposedly intelligent  assistant—any other is insignificant in comparison —is to be right.

It is also the only quality that the ChatGPT class of automated assistants cannot promise.

I think the fundamental problem is that LLMs aren’t “reasoning” about dynamics per se (though I used the word in my previous posts). What they know is derived from the training corpus, and there’s no reason to think that it reflects a solid understanding of dynamic systems. In fact there are presumably lots of examples in the corpus of failures to reason correctly about dynamic causality, even in the scientific literature.

This is similar to the reason AI image creators hallucinate legs and fingers: they know what the parts look like, but they don’t know how the parts work together to make the whole.

To paraphrase Meyer, LLM AI is the equivalent of a polite, well-read assistant who lacks an appreciation for complex systems, and aggressively indulges in laundry-list, dead-buffalo thinking about all but the simplest problems. I have no use for that until the situation improves (and there’s certainly hope for that). Worse, the tools are very articulate and confident in their clueless pronouncements, which is a deadly mix of attributes.

Related: On scientific understanding with artificial intelligence | Nature Reviews Physics

Sources of Information for Modeling

The traditional picture of information sources for modeling is a funnel. For example, in Some Basic Concepts in System Dynamics (2009), Forrester showed:

I think the diagram, or at least the concept, is much older than that.

However, I think the landscape has changed a lot, with more to come. Generally, the mental database hasn’t changed too much, but the numerical database has grown a lot. The funnel isn’t 1-dimensional, so the relationships have changed on some axes, but not so much on others.

Notionally, I’d propose that the situation is something like this:

The mental database is still king for variety of concepts and immediacy or salience of information (especially to the owner of the brain involved). And, it still has some weaknesses, like the inability to easily observe, agree on and quantify the constructs included in it. In the last few decades, the numerical database has extended its reach tremendously.

The proper shape of the plot is probably very domain specific. When I drew this, I had in mind the typical corporate or policy setting, where information systems contain only a fraction of the information necessary to understand the organizations involved. But in some areas, the reverse may be true. For example, in earth systems, datasets are vast and include measurements that human senses can’t even make, whereas personal experience – and therefore mental models – is limited and treacherous.

I think I’ve understated the importance of the written database in the diagram above – perhaps I’m missing a dimension characterizing its cumulative nature (compared to the transience of mental databases). There’s also an interesting evolution underway, as tools for text analysis and large language models (ChatGPT) are making the written database more numerical in nature.

Finally, I think there’s a missing database in the traditional framework, which has growing importance. That’s the database of models themselves. They’ve been around for a long time – especially in physical sciences, but also corporate spreadsheets and the like. But increasingly, reasonably sophisticated models of organizational components are available as inputs to higher-level strategic problem solving modeling efforts.

Scientific Revolutions in Ventity

I’ve long wanted to translate the Sterman-Wittenberg model of Kuhnian paradigm revolutions to Ventity. The original was in Dynamo, and I translated that to Vensim, but neither is really satisfactory, because both require provisioning array space for new paradigms statically, before it’s needed. This means simulating lots of useless 0s, and even worse, looking at them in the output.

The model is about the lifecycle of scientific paradigms, so a central feature is the occasional introduction and evolution of new paradigms, which eventually accumulate enough anomalies to erode confidence, making them vulnerable to the next great idea. So ideally, you’d like to introduce new paradigms dynamically and delete them when they no longer have many adherents. Dynamic creation and deletion of entities is of course a core feature of Ventity – it’s the tool this model has been waiting for all those years.

I finally got around to translating my Vensim version to Ventity recently. It works beautifully:

Above, paradigm confidence, showing eight dominant paradigms as well as many smaller paradigms that never rise to dominance. They disappear when they run out of adherents. Below, puzzles under attack for the same paradigms.

Links to the source papers and more notes on the model are in the Vensim library entry. I think the dynamics are generalizable to other aspects of thinking in paradigms, like filter bubbles. The model is also a bit ‘meta’: Ventity as a distinct modeling paradigm that’s neither in the classical array-based world nor the code-based discrete agent world has struggled to win mindshare.

A minor note on use: the Run Config includes two setups: “replicate” and “random”. The “replicate” setup, which is inactive by default, launches paradigms at fixed times given by initialization data from a run of the Vensim version. This makes it possible to compare the simulations without divergence from randomness. However, the randomized run will normally be the more interesting way to work with this model.

The model (requires Ventity, which has a free trial license):


Computer Collates Climate Contrarian Claims

Coan et al. in Nature have an interesting text analysis of climate skeptics’ claims.

I’ve been at this long enough to notice that a few perennial favorites are missing, perhaps because they date from the 90s, prior to the dataset.

The big one is “temperature isn’t rising” or “the temperature record is wrong.” This has lots of moving parts. Back in the 90s, a key idea was that satellite MSU records showed falling temperatures, implying that the surface station record was contaminated by Urban Heat Island (UHI) effects. That didn’t end well, when it turned out that the UAH code had errors and the trend reversed when they were fixed.

Later UHI made a comeback when the SurfaceStations project crowdsourced an assessment of temperature station quality. Some turned out to be pretty bad. But again, when the dust settled, it turned out that the temperature trend was bigger, not smaller, when poor sites were excluded and TOD was corrected. This shouldn’t have been a surprise, because windy day analsyses and a dozen other things already ruled out UHI, but …

I consider this a reminder of the fact that part of the credibility of mainstream climate science arises not from the fact that models are so good, but because so many alternatives have been tried, and proved so bad, only to rise again and again.

Spreadsheets Strike Again

In this BBC podcast, stand-up mathematician Matt Parker explains the latest big spreadsheet screwup: overstating European productivity growth.

There are a bunch of killers in spreadsheets, but in this case the culprit was lack of a time axis concept, making it easy to misalign times for the GDP and labor variables. The interesting thing is that a spreadsheet’s strong suite – visibility of the numbers – didn’t help. Someone should have seen 22% productivity growth and thought, “that’s bonkers” – but perhaps expectations of a COVID19 rebound short-circuited the mental reality check.

ChatGPT struggles with pandemics

I decided to try out a trickier problem on ChatGPT: epidemiology.

This is tougher, because it requires some domain knowledge about terminology as well as some math. R0 itself is a slippery concept. It appears that ChatGPT is essentially equating R0 and the transmission rate; perhaps the result would be different had I used a different concept like force of infection.

Notice how ChatGPT is partly responding to my prodding, but stubbornly refuses to give up on the idea that the transmission rate needs to be less than R0, even though the two are not comparable.

Well, we got there in the end.

ChatGPT and the Department Store Problem

Continuing with the theme, I tried the department store problem out on ChatGPT. This is a common test of stock-flow reasoning, in which participants assess the peak stock of people in a store from data on the inflow and outflow.

I posed a simplified version of the problem:

Interestingly, I had intended to have 6 people enter at 8am, but I made a typo. ChatGPT did a remarkable job of organizing my data into exactly the form I’d doodled in my notebook, but then happily integrated to wind up with -2 people in the store at the end.

This is pretty cool, but it’s interesting that ChatGPT was happy to correct the number of people in the room, without making the corresponding correction to people leaving. That makes the table inconsistent.

We got there in the end, but I think ChatGPT’s enthusiasm for reality checks may be a little weak. Overall though I’d still say this is a pretty good demonstration of stock-flow reasoning. I’d be curious how humans would perform on the same problem.