There are lots of good reasons for building models without data. However, if you want to measure something (i.e. estimate model parameters), produce results that are closely calibrated to history, or drive your model with historical inputs, you need data. Most statistical modeling you’ll see involves static or dynamically simple models and well-behaved datasets: nice flat files with uniform time steps, units matching (or, alarmingly, ignored), and no missing points. Things are generally much messier with a system dynamics model, which typically has broad scope and (one would hope) lots of dynamics. The diversity of data needed to accompany a model presents several challenges:
- disagreement among sources
- missing data points
- non-uniform time intervals
- variable quality of measurements
- diverse source formats (spreadsheets, text files, databases)
The mathematics for handling the technical estimation problems were developed by Fred Schweppe and others at MIT decades ago. David Peterson’s thesis lays out the details for SD-type models, and most of the functionality described is built into Vensim. It’s also possible, of course, to go a simpler route; even hand calibration is often effective and reasonably quick when coupled with Synthesim.
Either way, you have to get your data corralled first. For a simple model, I’ll build the data right into the dynamic model. But for complicated models, I usually don’t want the main model bogged down with units conversions and links to a zillion files. In that case, I first build a separate datamodel, which does all the integration and passes cleaned-up series to the main model as a fast binary file (an ordinary Vensim .vdf). In creating the data infrastructure, I try to maximize three things:
- Replicability. Minimize the number of manual steps in the process by making the data model do everything. Connect the datamodel directly to primary sources, in formats as close as possible to the original. Automate multiple steps with command scripts. Never use hand calculations scribbled on a piece of paper, unless you’re scrupulous about lab notebooks, or note the details in equations’ documentation field.
- Transparency. Often this means “don’t do complex calculations in spreadsheets.” Spreadsheets are very good at some things, like serving as a data container that gives good visibility. However, spreadsheet calculations are error-prone and hard to audit. So, I try to do everything, from units conversions to interpolation, in Vensim.
- Quality.#1 and #2 already go a long way toward ensuring quality. However, it’s possible to go further. First, actually look at the data. Take time to build a panel of on-screen graphs so that problems are instantly visible. Use a statistics or visualization package to explore it. Lately, I’ve been going a step farther, by writing Reality Checks to automatically test for discontinuities and other undesirable properties of spliced time series. This works well when the data is simply to voluminous to check manually.
This can be quite a bit of work up front, but the payoff is large: less model rework later, easy updates, and higher quality. It’s also easier generate graphics or statistics that help others to gain confidence in the model, though it’s sometimes important to help them recognize that goodness of fit is a weak test of quality.
It’s good to build the data infrastructure before you start modeling, because that way your drivers and quality control checks are in place as you build structure, so you avoid the pitfalls of an end-of-pipe inspection process. A frequent finding in our corporate work has been that cherished data is in fact rubbish, or means something quite different that what users have historically assumed. Ventana colleague Bill Arthur argues that modern IT practices are making the situation worse, not better, because firms aren’t retaining data as long (perhaps a misplaced side effect of a mania for freshness).
Here are some samples of things you can do in a datamodel, including some from C-ROADS:
GDP per cap[country]:=gdp[country]/population[country]
CO2 emissions[country]:=kton CO2 emissions[country]*tonC per kTonC
Switch to log scale:
log GDP per cap[country]:=LOG(GDP per cap[country],10)
Recenter data points on their intervals:
GDP := TIME SHIFT( EOY GDP, half a year)
This is necessary because reported data often represents the endpoint or accumulation over a period. For example, 1996 GDP means the accumulation over all of 1996, and is probably best placed at 1996.5 (depending on what you’re doing with it). 2005 inventory might mean inventory at the end of 2005. However, in a model, year X means Jan. 1st of X, or 1996.0 and 2005.0, especially if the time step is less than a year.
Subtract a computed mean:
HADCRUT3 anomaly vs 1990=IF THEN ELSE(HADCRUT3 anomaly=:NA:,:NA:,HADCRUT3 anomaly-HADCRUT3 mean 1990)
HADCRUT3 mean 1990= INITIAL(GET DATA MEAN(HADCRUT3 anomaly,y1980,y1999))
HADCRUT3 anomaly :RAW::=GET XLS DATA(‘?Data Comparison’,’T HADCRUT3′,’a’,’n1′)
Notice that this is tricky: the mean-centered variable HADCRUT3 anomaly vs 1990 can’t be a data equation, so it includes the test for HADCRUT3 anomaly=:NA: to ensure that it only generates values where the original series has values. That in turn requires use of the :RAW: keyword in the original series. Notice also that the GET DATA MEAN function is wrapped in an INITIAL statement, because it’s slow to execute and only needs to happen once.
There are lots of other tricks, but I’ll have to save those for another day.