A tale of Big Data and System Dynamics

I recently worked on a fascinating project that combined Big Data and System Dynamics (SD) to good effect. Neither method could have stood on its own, but the outcome really emphasized some of the strategic limitations of the data-driven approach. Including SD in the project simultaneously lowered the total cost of analysis, by avoiding data processing for things that could be determined a priori, and increased its value by connecting the data to business context.

I can’t give a direct account of what we did, because it’s proprietary, but here’s my best shot at the generalizable insights. The context was health care for some conditions that particularly affect low income and indigent populations. The patients are hard to track and hard to influence.

Two efforts worked in parallel: Big Data (led by another vendor) and System Dynamics (led by Ventana). I use the term “SD” loosely, because much of what we ultimately did was data-centric: agent based modeling and estimation of individual-level nonlinear dynamic models in Vensim. The Big Data vendor’s budget was two orders of magnitude greater than ours, mostly due to some expensive system integration tasks, but partly due to the caché of their brand and flashy approach, I suspect.

Predict Patient Events

Big Data idea #1 was to use machine learning methods to predict imminent expensive events, so that patients could be treated preemptively, saving the cost of ER visits and other expensive procedures. This involved statistical analysis of extremely detailed individual patient histories. In parallel, we modeled the delivery system and aggregate patient trajectories through many different system states. (Google’s Verily is pursuing something similar.)

Ultimately, the Big Data approach didn’t pan out. I think the cause was largely data limitations. Patient records have severe quality issues, including (we hypothesized) unobserved feedback from providers gaming the system to work around health coverage limitations. More importantly, it’s still  problematic to correlate well-observed aspects of patient care with other important states of the patient, like employment and social network support.

Some of the problems were not evident until we began looking at things from a stock-flow perspective. For example, it turned out that admission and release records were not conserving people. Test statistics on the data might have revealed this, but no one even thought to look until we created an operational description of the system and started trying to balance units and apply conservation laws. Machine learning algorithms were perfectly happy to consume the data uncritically.

Describing the system operationally in an SD model revealed a number of constraints that would have made implementation of the predictive approach difficult, even if data and algorithm constraints were eliminated. We also found that some of the insights from the Big Data approach were available from first principles in simple, aggregate “thinkpiece” models at a tiny fraction of the cost.

Things weren’t entirely rosy for the simulations, however. Building structural models is pretty quick, but calibrating them and testing alternative formulations is a slow process. Switching function calls to try a different machine learning algorithm is much easier. We really need hybrids that combine the best of both approaches.

An amusing aside: after data problems began to turn up, we found that we needed to understand more about patient histories. We made a movie that depicted a couple of years of events for tens of thousands of patients. It ran for almost an hour, with nothing but strips of colored dots. It has to be one of the most boring productions in the history of cinema, and yet it was absolutely riveting for anyone who understood the problem.

Segment Patient Populations & Buy Provider Risk

Big Data idea #2 was to abandon individual prediction and focus on population methods. The hope was to earn premium prices by improving on prevention in general, using that advantage to essentially sell reinsurance to providers through a patient performance contract.

The big data aspect involved a pharma marketing science favorite: segmentation. Cluster analysis and practical rules were used to identify “interesting” subpopulations of patients.

In parallel, we began to build agent and aggregate models of patient state trajectories, so see how the contract would actually work. Alarmingly, we discovered that the business rules of the proposal were not written down or fully specified anywhere. We made them up, but in the process discovered that there were many idiosyncratic cases that would need to be handled, like what to do with escrowed funds from patients whose performance history was incomplete at the time they churned out of a provider.

When we started to plug real numbers into the aggregate model, we immediately ran into a puzzle. We sampled costs from the (very large) full dataset, but we couldn’t replicate the relative costs of patients on and off the prevention protocol. After much digging, it turned out that there were two problems.

First, the distribution of patient costs is heavy-tailed (a power law). So, if clustering rules shift even a few high cost patients (out of millions) from one business category to another, the average cost can change significantly. Median costs are more stable, but the business doesn’t pay the median (this is essentially dictated by conservation of money).

Second, it turns out that clustering patients with short records is problematic, because it introduces selection and truncation biases into the determination of who’s at risk. The only valid fallback is to use each patient as its own control, using an errors in variables model. Once we refined the estimation techniques, we found that the short term savings from prevention didn’t pencil out very well when combined with the reinsurance business rules. (Fortunately, that’s not a general insight – we have another project underway that focuses on bigger advantages from prevention in the long run.)

Both of these issues should have been revealed by statistical red flags, poor confidence bounds, or failure to cross-validate. But for whatever, reason, that didn’t happen until we put the estimates into practice. Fortunately, that happened in our simulation model, not an expensive real trial.

Insights

Looking back at this adventure, I think we learned some important things:

  • There are no hard lines – we did SD with tens of thousands of agents, aggregate models, discrete event simulations, large statistical estimates by brute force simulation, and the Big Data approach did a lot of simple aggregation.
  • Detailed and aggregate approaches are complementary.
    • Big Data measurements contribute to SD or detailed simulations;
    • SD modeling vettes Big Data results in multiple ways.
  • Statistical approaches need business context, in-place validation, and synthetic data from simulations.
  • Anything you can do a priori, without data collection and processing, is an order of magnitude cheaper.
  • Simulation lets you rule out a lot of things that don’t matter and design experiments to focus on data that does matter, before you spend a ton of money.
  • SD raises questions you didn’t know to ask.
  • Hybrid approaches that embed machine learning algorithms in the decisions in SD models, or automate cross-validated calibration with structural experiments on SD models, would be very powerful.

Interestingly, the most enduring and interesting models, out of a dozen or so built over the course of the project, were some low-order SD models of patient life histories and adherence behavior. I hope to cover those in future posts.

The Ambiguity of Causal Loop Diagrams and Archetypes

I find causal loop diagramming to be a very useful brainstorming and presentation tool, but it falls short of what a model can do for you.

Here’s why. Consider the following pair of archetypes (Eroding Goals and Escalation, from wikipedia):

Eroding Goals and Escalation archetypes

Archetypes are generic causal loop diagram (CLD) templates, with a particular behavior story. The Escalation and Eroding Goals archetypes have identical feedback loop structures, but very different stories. So, there’s no unique mapping from feedback loops to behavior. In order to predict what a set of loops is going to do, you need more information.

Here’s an implementation of Eroding Goals:

Notice several things:

  • I had to specify where the stocks and flows are.
  • “Actions to Improve Goals” and “Pressure to Adjust Conditions” aren’t well defined (I made them proportional to “Gap”).
  • Gap is not a very good variable name.
  • The real world may have structure that’s not mentioned in the archetype (indicated in red).

Here’s Escalation:

The loop structure is mathematically identical; only the parameterization is different. Again, the missing information turns out to be crucial. For example, if A and B start with the same results, there is no escalation – A and B results remain constant. To get escalation, you either need (1) A and B to start in different states, or (2) some kind of drift or self-excitation in decision making (green arrow above).

Even then, you may get different results. (2) gives exponential growth, which is the standard story for escalation. (1) gives escalation that saturates:

The Escalation archetype would be better if it distinguished explicit goals for A and B results. Then you could mathematically express the key feature of (2) that gives rise to arms races:

  • A’s goal is x% more bombs than B
  • B’s goal is y% more bombs than A

Both of these models are instances of a generic second-order linear model that encompasses all possible things a linear model can do:

Notice that the first-order and second-order loops are disentangled here, which makes it easy to see the “inner” first order loops (which often contribute damping) and the “outer” second order loop, which can give rise to oscillation (as above) or the growth in the escalation archetype. That loop is difficult to discern when it’s presented as a figure-8.

Of course, one could map these archetypes to other figure-8 structures, like:

How could you tell the difference? You probably can’t, unless you consider what the stocks and flows are in an operational implementation of the archetype.

The bottom line is that the causal loop diagram of an archetype or anything else doesn’t tell you enough to simulate the behavior of the system. You have to specify additional assumptions. If the system is nonlinear or stochastic, there might be more assumptions than I’ve shown above, and they might be important in new ways. The process of surfacing and testing those assumptions by building a stock-flow model is very revealing.

If you don’t build a model, you’re in the awkward position of intuiting behavior from structure that doesn’t uniquely specify any particular mode. In doing so, you might be way ahead of non-systems thinkers approaching the same problem with a laundry list. But your ability to discover errors, incorporate data and discover leverage is far greater if you can simulate.

The model: wikiArchetypes1b.mdl (runs in any version of Vensim)

Loopy

I just gave Loopy a try, after seeing Gene Bellinger’s post about it.

It’s cool for diagramming, and fun. There are some clever features, like drawing a circle to create a node (though I was too dumb to figure that out right away). Its shareability and remixing are certainly useful.

However, I think one must be very cautious about simulating causal loop diagrams directly. A causal loop diagram is fundamentally underspecified, which is why no method of automated conversion of CLDs to models has been successful.

In this tool, behavior is animated by initially perturbing the system (e.g, increase the number of rabbits in a predator-prey system). Then you can follow the story around a loop via animated arrow polarity changes – more rabbits causes more foxes, more foxes causes less rabbits. This is essentially the storytelling method of determining loop polarity, which I’ve used many times to good effect.

However, as soon as the system has multiple loops, you’re in trouble. Link polarity tells you the direction of change, but not the gain or nonlinearity. So, when multiple loops interact, there’s no way to determine which is dominant. Also, in a real system it matters which nodes are stocks; it’s not sufficient to assume that there must be at least one integration somewhere around a loop.

You can test this for yourself by starting with the predator-prey example on the home page. The initial model is a discrete oscillator (more rabbits -> more foxes -> fewer rabbits). But the real system is nonlinear, with oscillation and other possible behaviors, depending on parameters. In Loopy, if you start adding explicit births and deaths, which should get you closer to the real system, simulations quickly result in a sea of arrows in conflicting directions, with no way to know which tendency wins. So, the loop polarity simulation could be somewhere between incomprehensible and dead wrong.

Similarly, if you consider an SIR infection model, there are three loops of interest: spread of infection by contact, saturation from running out of susceptibles, and recovery of infected people. Depending on the loop gains, it can exhibit different behaviors. If recovery is stronger than spread, the infection dies out. If spread is initially stronger than recovery, the infection shifts from exponential growth to goal seeking behavior as dominance shifts nonlinearly from the spread loop to the saturation loop.

I think it would be better if the tool restricted itself to telling the story of one loop at a time, without making the leap to system simulations that are bound to be incorrect in many multiloop cases. With that simplification, I’d consider this a useful item in the toolkit. As is, I think it could be used judiciously for explanations, but for conceptualization it seems likely to prove dangerous.

My mind goes back to Barry Richmond’s approach to systems here. Causal loop diagrams promote thinking about feedback, but they aren’t very good at providing an operational description of how things work. When you’re trying to figure out something that you don’t understand a priori, you need the bottom-up approach to synthesize the parts you understand into the whole you’re grasping for, so you can test whether your understanding of processes explains observed behavior. That requires stocks and flows, explicit goals and actual states, and all the other things system dynamics is about. If we could get to that as elegantly as Loopy gets to CLDs, that would be something.

Aging is unnatural

Larry Yeager and I submitted a paper to the SD conference, proposing dynamic cohorts as a new way to model aging populations, vehicle fleets, and other quantities. Cohorts aren’t new*, of course, but Ventity makes it practical to allocate them on demand, so you don’t waste computation and attention on a lot of inactive zeroes.

The traditional alternative has been aging chains. Setting aside technical issues like dispersion, I think there’s a basic conceptual problem with aging chains: they aren’t a natural, intuitive operational representation of what’s happening in a system. Here’s why.

Consider a model of an individual. You’d probably model age like this:

Here, age is a state of the individual that increases with aging. Simple. Equivalently, you could calculate it from the individual’s birth date:

Ideally, a model of a population would preserve the simplicity of the model of the individual. But that’s not what the aging chain does:

This says that, as individuals age, they flow from one stock to another. But there’s no equivalent physical process for that. People don’t flow anywhere on their birthday. Age is continuous, but the separate stocks here represent an arbitrary discretization of age.

Even worse, if there’s mortality, the transition time from age x to age x+1 (the taus on the diagram above) is not precisely one year.

You can contrast this with most categorical attributes of an individual or population:

When cars change (geographic) state, the flow represents an actual, physical movement across a boundary, which seems a lot more intuitive.

As we’ll show in the forthcoming paper, dynamic cohorts provide a more natural link between models of individuals and groups, and make it easy to see the lifecycle of a set of related entities. Here are the population sizes of annual cohorts for Japan:

I’ll link the paper here when it’s available.


* This was one of the applications we proposed in the original Ventity white paper, and others have arrived at the same idea, minus the dynamic allocation of the cohorts. Demographers have been doing it this way for ages, though usually in statistical approaches with no visual representation of the system.

Data science meets the bottom line

A view from simulation & System Dynamics


I come to data science from simulation and System Dynamics, which originated in control engineering, rather than from the statistics and database world. For much of my career, I’ve been working on problems in strategy and public policy, where we have some access to mental models and other a priori information, but little formal data. The attribution of success is tough, due to the ambiguity, long time horizons and diverse stakeholders.

I’ve always looked over the fence into the big data pasture with a bit of envy, because it seemed that most projects were more tactical, and establishing value based on immediate operational improvements would be fairly simple. So, I was surprised to see data scientists’ angst over establishing business value for their work:

One part of solving the business value problem comes naturally when you approach things from the engineering point of view. It’s second nature to include an objective function in our models, whether it’s the cash flow NPV for a firm, a project’s duration, or delta-V for a rocket. When you start with an abstract statistical model, you have to be a little more deliberate about representing the goal after the model is estimated (a simulation model may be the delivery vehicle that’s needed).

You can solve a problem whether you start with the model or start with the data, but I think your preferred approach does shape your world view. Here’s my vision of the simulation-centric universe:

The more your aspirations cross organizational silos, the more you need the engineering mindset, because you’ll have data gaps at the boundaries – variations in source, frequency, aggregation and interpretation. You can backfill those gaps with structural knowledge, so that the model-data combination yields good indirect measurements of system state. A machine learning algorithm doesn’t know about dimensional consistency, conservation of people, or accounting identities unless the data reveals such structure, but you probably do. On the other hand, when your problem is local, data is plentiful and your prior knowledge is weak, an algorithm can explore more possibilities than you can dream up in a given amount of time. (Combining the two approaches, by using prior knowledge of structure as “free data” constraints for automated model construction, is an area of active research here at Ventana.)

I think all approaches have a lot in common. We’re all trying to improve performance with systems science, we all have to deal with messy data that’s expensive to process, and we all face challenges formulating problems and staying connected to decision makers. Simulations need better connectivity to data and users, and purely data driven approaches aren’t going to solve our biggest problems without some strategic context, so maybe the big data and simulation worlds should be working with each other more.

Cross-posted from LinkedIn

Remembering Jay Forrester

I’m sad to report that Jay Forrester, pioneer in servo control, digital computing, System Dynamics, global modeling, and education has passed away at the age of 98.

forresterred

I’ve only begun to think about the ways Jay influenced my life, but digging through the archives here I ran across a nice short video clip on Jay’s hope for the future. Jay sounds as prescient as ever, given recent events:

“The coming century, I think, will be dominated by major social, political turmoil. And it will result primarily because people are doing what they think they should do, but do not realize that what they’re doing are causing these problems. So, I think the hope for this coming century is to develop a sufficiently large percentage of the population that have true insight into the nature of the complex systems within which they live.”

I delve into the roots of this thought in Election Reflection (2010).

Here’s a sampling of other Forrester ideas from these pages:

The Law of Attraction

Forrester on the Financial Crisis

Self-generated seasonal cycles

Deeper Lessons

Servo-chicken

Models

Market Growth

Urban Dynamics

Industrial Dynamics

World Dynamics

 

 

 

Dead buffalo diagrams

I think it was George Richardson who coined the term “dead buffalo” to refer to a diagram that surrounds a central concept with a hail of inbound causal arrows explaining it. This arrangement can be pretty useful as a list of things to think about, but it’s not much help toward solving a systemic problem from an endogenous point of view.

I recently found the granddaddy of them all:

dead_buffalo

Early economic dynamics: Samuelson's multiplier-accelerator

Paul Samuelson’s 1939 analysis of the multiplier-accelerator is a neat piece of work. Too bad it’s wrong.

Interestingly, this work dates from a time in which the very idea of a mathematical model was still questioned:

Contrary to the impression commonly held, mathematical methods properly employed, far from making economic theory more abstract, actually serve as a powerful liberating device enabling the entertainment and analysis of ever more realistic and complicated hypotheses.

Samuelson should be hailed as one of the early explorers of a very big jungle.

The basic statement of the model is very simple:

NationalIncome

In quasi-System Dynamics notation, that looks like:

SamuelsonDiagramB

A caveat:

The limitations inherent in so simplified a picture as that presented here should not be overlooked. In particular, it assumes that the marginal propensity to consume and the relation are constants; actually these will change with the level of income, so that this representation is strictly a marginal analysis to be applied to the study of small oscillations. Nevertheless it is more general than the usual analysis.

Samuelson hand-simulated the model (it’s fun – once – but he runs four scenarios):Simulated Samuelson then solves the discrete time system, to identify four regions with different behavior: goal seeking (exponential decay to a steady state), damped oscillations, unstable (explosive) oscillations, and unstable exponential growth or decline. He nicely maps the parameter space:

parameterSpace

ParamRegionBehaviorSo where’s the problem?

The first is not so much of Samuelson’s making as it is a limitation of the pre-computer era. The essential simplification of the model for analytic solution is;

Simplified

This is fine, but it’s incredibly abstract. Presented with this equation out of context – as readers often are – it’s almost impossible to posit a sensible description of how the economy works that would enable one to critique the model. This kind of notation remains common in econometrics, to the detriment of understanding and progress.

At the first SD conference, Gil Low presented a critique and reconstruction of the MA model that addressed this problem. He reconstructed the model, providing an operational description of the economy that remains consistent with the multiplier-accelerator framework.

LowThe mere act of crafting a stock-flow description reveals problem #1: the basic multiplier-accelerator doesn’t conserve stuff.

inventory1 InventoryCapital2Non-conservation of stuff leads to problem #2. When you do implement inventories and capital stocks, the period of multiplier-accelerator oscillations moves to about 2 decades – far from the 3-7 year period of the business cycle that Samuelson originally sought to explain. This occurs in part because the capital stock, with a 15-year lifetime, introduces considerable momentum. You simply can’t discover this problem in the original multiplier-accelerator framework, because too many physical and behavioral time constants are buried in the assumptions associated with its 2 parameters.

Low goes on to introduce labor, finding that variations in capacity utilization do produce oscillations of the required time scale.

ShortTermI think there’s a third problem with the approach as well: discrete time. Discrete time notation is convenient for matching a model to data sampled at regular intervals. But the economy is not even remotely close to operating in discrete annual steps. Moreover a one-year step is dangerously close to the 3-year period of the business cycle phenomenon of interest. This means that it is a distinct possibility that some of the oscillatory tendency is an artifact of discrete time sampling. While improper oscillations can be detected analytically, with discrete time notation it’s not easy to apply the simple heuristic of halving the time step to test stability, because it merely compresses the time axis or causes problems with implicit time constants, depending on how the model is implemented. Halving the time step and switching to RK4 integration illustrates these issues:

RK4

It seems like a no-brainer, that economic dynamic models should start with operational descriptions, continuous time, and engineering state variable or stock flow notation. Abstraction and discrete time should emerge as simplifications, as needed for analysis or calibration. The fact that this has not become standard operating procedure suggests that the invisible hand is sometimes rather slow as it gropes for understanding.

The model is in my library.

See Richardson’s Feedback Thought in Social Science and Systems Theory for more history.

Parameter Distributions

Answering my own question, here’s the distribution of all 12,000 constants from a dozen models, harvested from my hard drive. About half are from Ventana, and there are a few classics, like World3. All are policy models – no physics, biology, etc.

ParamDistThe vertical scale is magnitude, ABS(x). Values are sorted on the horizontal axis, so that negative values appear on the left. Incredibly, there were only about 60 negative values in the set. Clearly, unlike linear models where signs fall where they may, there’s a strong user preference for parameters with a positive sense.

Next comes a big block of 0s, which don’t show on the log scale. Most of the 0s are not really interesting parameters; they’re things like switches in subscript mapping, though doubtless some are real.

At the right are the positive values, ranging from about 10^-15 to 10^+15. The extremes are units converters and physical parameters (area of the earth). There are a couple of flat spots in the distribution – 1s (green arrow), probably corresponding with the 0s, though some are surely “interesting”, and years (i.e. things with a value of about 2000, blue arrow).

If you look at just the positive parameters, here’s the density histogram, in log10 magnitude bins:

PositiveParmDistAgain, the two big peaks are the 1s and the 2000s. The 0s would be off the scale by a factor of 2. There’s clearly some asymmetry – more numbers greater than 1 (magnitude 0) than less.

LogPositiveParamDistOne thing that seems clear here is that log-uniform (which would be a flat line on the last two graphs) is a bad guess.

What's the empirical distribution of parameters?

Vensim‘s answer to exploring ill-behaved problem spaces is either to do hill-climbing with random restarts, or MCMC and simulated annealing. Either way, you need to start with some initial distribution of points to search.

It’s helpful if that distribution is somehow efficient at exploring the interesting parts of the space. I think this is closely related to the problem of selecting uninformative priors in Bayesian statistics. There’s lots of research about appropriate uninformative priors for various kinds of parameters. For example,

  • If a parameter represents a probability, one might choose the Jeffreys or Haldane prior.
  • Indifference to units, scale and inversion might suggest the use of a log uniform prior, where nothing else is known about a positive parameter

However, when a user specifies a parameter in Vensim, we don’t even know what it represents. So what’s the appropriate prior for a parameter that might be positive or negative, a probability, a time constant, a scale factor, an initial condition for a physical stock, etc.?

On the other hand, we aren’t quite as ignorant as the pure maximum entropy derivation usually assumes. For example,

  • All numbers have to lie between the largest and smallest float or double, i.e. +/- 3e38 or 2e308.
  • More practically, no one scales their models such that a parameter like 6.5e173 would ever be required. There’s a reason that metric prefixes range from yotta to yocto (10^24 to 10^-24). The only constant I can think of that approaches that range is Avogadro’s number (though there are probably others), and that’s not normally a changeable parameter.
  • For lots of things, one can impose more constraints, given a little more information,
    • A time constant or delay must lie on [TIME STEP,infinity], and the “infinity” of interest is practically limited by the simulation duration.
    • A fractional rate of change similarly must lie on [-1/TIME STEP,1/TIME STEP] for stability
    • Other parameters probably have limits for stability, though it may be hard to discover them except by experiment.
    • A parameter with units of year is probably modern, [1900-2100], unless you’re doing Mayan archaeology or paleoclimate.

At some point, the assumptions become too heroic, and we need to rely on users for some help. But it would still be really interesting to see the distribution of all parameters in real models. (See next …)