There are lots of good reasons for building models without data. However, if you want to measure something (i.e. estimate model parameters), produce results that are closely calibrated to history, or drive your model with historical inputs, you need data. Most statistical modeling you’ll see involves static or dynamically simple models and well-behaved datasets: nice flat files with uniform time steps, units matching (or, alarmingly, ignored), and no missing points. Things are generally much messier with a system dynamics model, which typically has broad scope and (one would hope) lots of dynamics. The diversity of data needed to accompany a model presents several challenges:
- disagreement among sources
- missing data points
- non-uniform time intervals
- variable quality of measurements
- diverse source formats (spreadsheets, text files, databases)
The mathematics for handling the technical estimation problems were developed by Fred Schweppe and others at MIT decades ago. David Peterson’s thesis lays out the details for SD-type models, and most of the functionality described is built into Vensim. It’s also possible, of course, to go a simpler route; even hand calibration is often effective and reasonably quick when coupled with Synthesim.
Either way, you have to get your data corralled first. For a simple model, I’ll build the data right into the dynamic model. But for complicated models, I usually don’t want the main model bogged down with units conversions and links to a zillion files. In that case, I first build a separate datamodel, which does all the integration and passes cleaned-up series to the main model as a fast binary file (an ordinary Vensim .vdf). In creating the data infrastructure, I try to maximize three things:
- Replicability. Minimize the number of manual steps in the process by making the data model do everything. Connect the datamodel directly to primary sources, in formats as close as possible to the original. Automate multiple steps with command scripts. Never use hand calculations scribbled on a piece of paper, unless you’re scrupulous about lab notebooks, or note the details in equations’ documentation field.
- Transparency. Often this means “don’t do complex calculations in spreadsheets.” Spreadsheets are very good at some things, like serving as a data container that gives good visibility. However, spreadsheet calculations are error-prone and hard to audit. So, I try to do everything, from units conversions to interpolation, in Vensim.
- Quality.#1 and #2 already go a long way toward ensuring quality. However, it’s possible to go further. First, actually look at the data. Take time to build a panel of on-screen graphs so that problems are instantly visible. Use a statistics or visualization package to explore it. Lately, I’ve been going a step farther, by writing Reality Checks to automatically test for discontinuities and other undesirable properties of spliced time series. This works well when the data is simply to voluminous to check manually.
This can be quite a bit of work up front, but the payoff is large: less model rework later, easy updates, and higher quality. It’s also easier generate graphics or statistics that help others to gain confidence in the model, though it’s sometimes important to help them recognize that goodness of fit is a weak test of quality.
It’s good to build the data infrastructure before you start modeling, because that way your drivers and quality control checks are in place as you build structure, so you avoid the pitfalls of an end-of-pipe inspection process. A frequent finding in our corporate work has been that cherished data is in fact rubbish, or means something quite different that what users have historically assumed. Ventana colleague Bill Arthur argues that modern IT practices are making the situation worse, not better, because firms aren’t retaining data as long (perhaps a misplaced side effect of a mania for freshness).
Here are some samples of things you can do in a datamodel, including some from C-ROADS:
Combine metrics:
GDP per cap[country]:=gdp[country]/population[country]
Convert units:
CO2 emissions[country]:=kton CO2 emissions[country]*tonC per kTonC
Switch to log scale:
log GDP per cap[country]:=LOG(GDP per cap[country],10)
Recenter data points on their intervals:
GDP := TIME SHIFT( EOY GDP, half a year)
This is necessary because reported data often represents the endpoint or accumulation over a period. For example, 1996 GDP means the accumulation over all of 1996, and is probably best placed at 1996.5 (depending on what you’re doing with it). 2005 inventory might mean inventory at the end of 2005. However, in a model, year X means Jan. 1st of X, or 1996.0 and 2005.0, especially if the time step is less than a year.
Subtract a computed mean:
HADCRUT3 anomaly vs 1990=IF THEN ELSE(HADCRUT3 anomaly=:NA:,:NA:,HADCRUT3 anomaly-HADCRUT3 mean 1990)
HADCRUT3 mean 1990= INITIAL(GET DATA MEAN(HADCRUT3 anomaly,y1980,y1999))
HADCRUT3 anomaly :RAW::=GET XLS DATA(‘?Data Comparison’,’T HADCRUT3′,’a’,’n1′)
Notice that this is tricky: the mean-centered variable HADCRUT3 anomaly vs 1990 can’t be a data equation, so it includes the test for HADCRUT3 anomaly=:NA: to ensure that it only generates values where the original series has values. That in turn requires use of the :RAW: keyword in the original series. Notice also that the GET DATA MEAN function is wrapped in an INITIAL statement, because it’s slow to execute and only needs to happen once.
There are lots of other tricks, but I’ll have to save those for another day.
Tom, very good comments. Most of what I want to say is “Me, too,” so I’ll refrain from that level of detail, but there are two points I’d like to make.
First, I think the notion of data transparency and spreadsheets extends past data modeling for SD models. Even though I’m not as skilled as I’d like to be with it, I’ve used the (free) array language J (http://www.jsoftware.com/) in lieu of spreadsheets whenever I can for the past few years, and it’s been quite a good ride. To stay on topic, I have used it to prepare data for SD models, and I have used it to analyze data from such models.
My second point is more of a query than a point to make. I, too, see value in applying good data analysis (call it statistics, if you will) to SD model results in at least certain cases. For a variety of reasons, I’ve been exploring a Bayesian analysis approach using MCMC and the MCSim simulator for the past few years. I’m not an expert in the approach Vensim uses, but I gather it’s more closely allied to Kalman filtering and that thread of inquiry. I’m curious if you’ve got any insights into how the two approaches might differ in terms of the insights they provide.
Thanks for the ideas you share here. BTW, I don’t want to take away from Vensim by mentioning MCSim; I suspect the use of multiple tools can help us become better at this stuff. There was research by Bill Curtis at the old MCC that suggested a correlation between the number of languages a programmer used and the programmer’s skill at software. Perhaps the same thing applies to us.
Bill
Hi Bill –
I’m not familiar enough with MCSim to comment. A Kalman filter is essentially a Bayesian updating process for the model state, but I’m guessing that you mean something different. Now I’m curious …
I used to have a bookmark to a modeling software package that aimed at total replicability, but I can’t find it at the moment. However, there are lots of ways to skin the cat. R scripts are fairly complete, for example.
Tom
Hi, Tom.
In essence, MCSim does what OpenBUGS (http://mathstat.helsinki.fi/openbugs/) does, except that it integrates that with a differential equation solver using the Gear algorithm. http://en.wikipedia.org/wiki/MCSim and http://fredomatic.free.fr/ describe it, and the manual is at http://www.gnu.org/software/mcsim/mcsim.html. http://www.gnu.org/software/mcsim/mcsim.html#SEC49 begins to describe the setup of the Bayesian analysis, although it really helps to know a bit about MCMC sampling first and perhaps to find a few of Frederic Bois’ journal articles. (It also helps to find the quick reference card on my site, if you want to use it as a DYNAMO equivalent.)
And, yes, R is often one’s good friend. While I’ve tended to use J both to set up more complex MCSim experiments and to analyze the data, I could have done that in R or Gnuplot, too, and R has a nice interface to OpenBUGS, as you probably know.
Bill
Perhaps put another way, I sense Vensim’s approach uses Bayesian analysis as developed for control theory, while MCSim uses Bayesian analysis as developed for physics. I’m curious if they give essentially the same results or if there are times to choose one or the other (and what those criteria might be).