Coronavirus Curve-fitting OverConfidence

This is a follow-on to The Normal distribution is a bad COVID19 model.

I understand that the IHME model is now more or less the official tool of the Federal Government. Normally I’m happy to see models guiding policy. It’s better than the alternative: would you fly in a plane designed by lawyers? (Apparently we have been.)

However, there’s nothing magic about a model. Using flawed methods, bad data, the wrong boundary, etc. can make the results GIGO. When a bad model blows up, the consequences can be just as harmful as any other bad reasoning. In addition, the metaphorical shrapnel hits the rest of us modelers. Currently, I’m hiding in my foxhole.

On top of the issues I mentioned previously, I think there are two more problems with the IHME model:

First, they fit the Normal distribution to cumulative cases, rather than incremental cases. Even in a parallel universe where the nonphysical curve fit was optimal, this would lead to understatement of the uncertainty in the projections.

Second, because the model has no operational mapping of real-world concepts to equation structure, you have no hooks to use to inject policy changes and the uncertainty associated with them. You have to construct some kind of arbitrary index and translate that to changes in the size and timing of the peak in an unprincipled way. This defeats the purpose of having a model.

For example, from the methods paper:

A covariate of days with expected exponential growth in the cumulative death rate was created using information on the number of days after the death rate exceeded 0.31 per million to the day when different social distancing measures were mandated by local and national government: school closures, non-essential business closures including bars and restaurants, stay-at-home recommendations, and travel restrictions including public transport closures. Days with 1 measure were counted as 0.67 equivalents, days with 2 measures as 0.334 equivalents and with 3 or 4 measures as 0.

This postulates a relationship that has only the most notional grounding. There’s no concept of compliance, nor any sense of the effect of stringency and exceptions.

In the real world, there’s also no linear relationship between “# policies implemented” and “days of exponential growth.” In fact, I would expect this to be extremely nonlinear, with a threshold effect. Either your policies reduce R0 below 1 and the epidemic peaks and shrinks, or they don’t, and it continues to grow at some positive rate until a large part of the population is infected. I don’t think this structure captures that reality at all.

That’s why, in the IHME figure above (retrieved yesterday), you don’t see any scenarios in which the epidemic fizzles, because we get lucky and warm weather slows the virus, or there are many more mild cases than we thought. You also don’t see any runaway scenarios in which measures fail to bring R0 below 1, resulting in sustained growth. Nor is there any possibility of ending measures too soon, resulting in an echo.

For comparison, I ran some sensitivity runs my model for North Dakota last night. I included uncertainty from fit to data (for example, R0 constrained to fit observations via MCMC) and some a priori uncertainty about effectiveness and duration of measures, and from the literature about fatality rates, seasonality, and unobserved asymptomatics.

I found that I couldn’t exclude the IHME projections from my confidence bounds, so they’re not completely crazy. However, they understate the uncertainty in the situation by a huge margin. They forecast the peak at a fairly definite time, plus or minus a factor of two. With my hybrid-SEIR model, the 95% bounds include variation by a factor of 10. The difference is that their bounds are derived only from curve fitting, and therefore omit a vast amount of structural uncertainty that is represented in my model.

Who is right? We could argue, but since the IHME model is statistically flawed and doesn’t include any direct effect of uncertainty in R0, prevalence of unobserved mild cases, temperature sensitivity of the virus, effectiveness of measures, compliance, travel, etc., I would not put any money on the future remaining within their confidence bounds.

The Normal distribution is a bad COVID19 model

Forecasting diffusion processes by fitting sigmoid curves has a long history of failure. Let’s not repeat those mistakes in the COVID19 epidemic.

I’ve seen several models explaining “flattening the curve” that use the Normal distribution as a model of the coronavirus epidemic. Now this model uses it to forecast peak hospital load:

We developed a curve-fitting tool to fit a nonlinear mixed effects model to the available admin 1 cumulative death data. The cumulative death rate for each location is assumed to follow a parametrized Gaussian error function … where the function is the Gaussian error function(written explicitly above), p controls the maximum death rate at each location, t is the time since death rate exceeded 1e-15, ß(beta)is a location-specific inflection point(time at which rate of increase of the death rate is maximum), and α(alpha)is a location-specific growth parameter. Other sigmoidal functional forms … were considered but did not fit the data as well. Data were fit to the log of the death rate in the available data, using an optimization framework described in the appendix.

One bell-shaped curve is as good as another, right? No!

Like Young Frankenstein, epidemic curves are not Normal.

1. Fit to data is a weak test.

The graph below compares 3 possible models: the Normal distribution, the Logistic distribution (which has an equivalent differential equation interpretation), and the SEIR model. Consider what’s happening when you fit a sigmoid to the epidemic data so far (red box). The curves below are normalized to yield similar peaks, but imagine what would happen to the peaks if you fit all 3 to the same data series.

The problem is that this curve-fitting exercise expects data from a small portion of the behavior to tell you about the peak. But over that interval, there’s little behavior variation. Any exponential is going to fit reasonably well. Even worse, if there are any biases in the data, such as dramatic shifts in test coverage, the fit is likely to reflect those biases as much as it does the physics of the system. That’s largely why the history of fitting diffusion models to emerging trends in the forecasting literature is so dreadful.

After the peak, the right tail of the SEIR model is also quite different, because the time constant of recovery is different from the time constant for the growth phase. This asymmetry may also have implications for planning.

2. The properties of the Normal distribution don’t match the observed behavior of coronavirus.

It’s easier to see what’s going on if you plot the curves above on a log-y scale:

The logistic and SEIR models have a linear left tail. That is to say that they have a constant growth rate in the early epidemic, until controls are imposed or you run out of susceptible people.

The Normal distribution (red) is a parabola, which means that the growth rate is steadily decreasing, long before you get near the peak. Similarly, if you go backwards in time, the Normal distribution predicts that the growth rate would have been higher back in November, when patient 0 emerged.

There is some reason to think that epidemics start faster due to social network topology, but also some reasons for slower emergence. In any case, that’s not what is observed for COVID19 – uncontrolled growth rates are pretty constant:

https://aatishb.com/covidtrends/

3. With weak data, you MUST have other quality checks

Mining data to extract relationships works great in many settings. But when you have sparse data with lots of known measurement problems, it’s treacherous. In that case, you need a model of the physics of the system and the lags and biases in the data generating process. Then you test that model against all available information, including

• conservation laws,
• operational correspondence with physical processes,
• opinions from subject matter experts and measurements from other levels of aggregation,
• dimensional consistency,
• robustness in extreme conditions, and finally
• fit to data.

Fortunately, a good starting point has existed for almost a century: the SEIR model. It’s not without pitfalls, and needs some disaggregation and a complementary model of policies and the case reporting process, but if you want simple projections, it’s a good place to start.

Once you have triangulation from all of these sources, you have some hope of getting the peak right. But your confidence bounds should still be derived not only from the fit itself, but also priors on parameters that were not part of the estimation process.

Coronavirus Roundup

I’ve been looking at early model-based projections for the coronavirus outbreak (SARS-CoV-2, COVID-19). The following post collects some things I’ve found informative. I’m eager to hear of new links in the comments.

Disease modelers gaze into their computers to see the future of Covid-19, and it isn’t good

The original SIR epidemic model, by Kermack and McKendrick. Very interesting to see how they thought about it in the pre-computer era, and how durable their analysis has been:

A data dashboard at Johns Hopkins:

A Lancet article that may give some hope for lower mortality:

The CDC’s flu forecasting activity:

Some literature, mostly “gray” preprints from MedRxiv, all open access:

A podcast with some background on transmission from Richard Larson, MIT (intestine alert – not for the squeamish!):

This blog post by Josh at Cassandra Capital collects quite a bit more interesting literature, and fits a simple SIR model to the data. I can’t vouch for the analysis because I haven’t looked into it in detail, but the links are definitely useful. One thing I note is that his fatality rate (12%) is much higher than in other sources I’ve seen (.5-3%) so hopefully things are less dire than shown here.

I had high hopes that social media might provide early links to breaking literature, but unfortunately the signal is swamped by rumors and conspiracy theories. The problem is made more difficult by naming – coronavirus, COVID19, SARS-CoV-2, etc. If you don’t include “mathematical model” or similar terms in your search, it’s really hopeless.

If your interested in exploring this yourself, the samples in the standard Ventity distribution include a family of infection models. I plan to update some of these and report back.

Assessing the predictability of nonlinear dynamics

An interesting exploration of the limits of data-driven predictions in nonlinear dynamic problems:

Assessing the predictability of nonlinear dynamics under smooth parameter changes
Simone Cenci, Lucas P. Medeiros, George Sugihara and Serguei Saavedra
https://doi.org/10.1098/rsif.2019.0627

Short-term forecasts of nonlinear dynamics are important for risk-assessment studies and to inform sustainable decision-making for physical, biological and financial problems, among others. Generally, the accuracy of short-term forecasts depends upon two main factors: the capacity of learning algorithms to generalize well on unseen data and the intrinsic predictability of the dynamics. While generalization skills of learning algorithms can be assessed with well-established methods, estimating the predictability of the underlying nonlinear generating process from empirical time series remains a big challenge. Here, we show that, in changing environments, the predictability of nonlinear dynamics can be associated with the time-varying stability of the system with respect to smooth changes in model parameters, i.e. its local structural stability. Using synthetic data, we demonstrate that forecasts from locally structurally unstable states in smoothly changing environments can produce significantly large prediction errors, and we provide a systematic methodology to identify these states from data. Finally, we illustrate the practical applicability of our results using an empirical dataset. Overall, this study provides a framework to associate an uncertainty level with short-term forecasts made in smoothly changing environments.

Challenges Sourcing Parameters for Dynamic Models

A colleague recently pointed me to this survey:

Estimating the price elasticity of fuel demand with stated preferences derived from a situational approach

It starts with a review of a variety of studies:

Table 1. Price elasticities of fuel demand reported in the literature, by average year of observation.

This is similar to other meta-analyses and surveys I’ve seen in the past. That means using it directly is potentially problematic. In a model, you’d typically plug the elasticity into something like the following:

```Indicated fuel demand
= reference fuel demand * (price/reference price) ^ elasticity```

You’d probably have the expression above embedded in a larger structure, with energy requirements embodied in the capital stock, and various market-clearing feedback loops (as below). The problem is that plugging the elasticities from the literature into a dynamic model involves several treacherous leaps.

First, do the parameter values even make sense? Notice in the results above that 33% of the long term estimates have magnitude < .3, overlapping the top 25% of the short term estimates. That’s a big red flag. Do they have conflicting definitions of “short” and “long”? Are there systematic methods problems?

Second, are they robust as you plan to use them? Many of the short term estimates have magnitude <<.1, meaning that a modest supply shock would cause fuel expenditures to exceed GDP. This is primarily problem with the equation above (but that’s likely similar to what was estimated). A better formulation would consider non-constant elasticity, but most likely the data is not informative about the extremes. One of the long term estimates is even positive – I’d be interested to see the rationale for that. Perhaps fuel is a luxury good?

Third, are the parameters any good? My guess is that some of these estimates are simply violating good practice for estimating dynamic systems. The real long term response involves a lot of lags on varying time scales, from annual (perceptions of prices and behavior change) to decadal (fleet turnover, moving, mode-switching) to longer (infrastructure and urban development). Almost certainly some of this is ignored in the estimate, meaning that the true magnitude of the long term response is understated.

Stated preference estimates avoid some problems, but create others. In the short run, people have a good sense of their options and responses. But in the long term, likely not: you’re essentially asking them to mentally simulate a complex system, evaluating options that may not even exist at present. Expert judgments are subject to some of the same limitations.

I think this explains why it’s possible to build a model that’s backed up with a lot of expertise and literature at every equation, that fails to reproduce the aggregate behavior of the system. Until you’ve spend time integrating components, reconciling conflicting definitions across domains, and revisiting open-loop estimates in a closed-loop context, you don’t have an internally consistent story. Getting to that is a big part of the value of dynamic modeling.

Not all models are wrong.

Box’s famous comment, that “all models are wrong,” gets repeated ad nauseum (even by me). I think it’s essential to be aware of this in the sloppy sciences, but it does a disservice to modeling and simulation in general.

As far as I’m concerned, a lot of models are basically right. I recently worked with some kids on an air track experiment in physics. We timed the acceleration of a sled released from various heights, and plotted the data. Then we used a quadratic fit, based on a simple dynamic model, to predict the next point. We were within a hundredth of a second, confirmed by video analysis.

Sure, we omitted lots of things, notably air resistance and relativity. But so what? There’s no useful sense in which the model was “wrong,” anywhere near the conditions of the experiment. (Not surprisingly, you can find a few cranks who contest Newton’s laws anyway.)

I think a lot of uncertain phenomena in social sciences operate on a backbone of the same kind of “physics.” The future behavior of the government is quite unpredictable, but there isn’t much uncertainty about accounting, e.g., that increasing the deficit increases the debt.

The domain of wrong but useful models remains large (within an even bigger sea of simple ignorance), but I think more and more things are falling into the category of models that are basically right. The trick is to be able to spot the difference. Some people clearly can’t:

A&G provide no formal method to distinguish between situations in which models yield useful or spurious forecasts. In an earlier paper, they claimed rather broadly,

‘To our knowledge, there is no empirical evidence to suggest that presenting opinions in mathematical terms rather than in words will contribute to forecast accuracy.’ (page 1002)

This statement may be true in some settings, but obviously not in general. There are many situations in which mathematical models have good predictive power and outperform informal judgments by a wide margin.

I wonder how well one could do with verbal predictions of a simple physical system? Score one for the models.

Prediction, in context

I’m increasingly running into machine learning approaches to prediction in health care. A common application is identification of risks for (expensive) infections or readmission. The basic idea is to treat patients like a function approximation problem.

The hospital compiles a big dataset on patient demographics, health status, exposure to procedures, and infection outcomes. A vendor slurps this up and turns some algorithm loose on the data, seeking the risk factors associated with the infection. It might look like this:

… except that there might be 200 predictors, not six – more than you can handle by eyeballing scatter plots or control charts. Once you have a risk model, you know which patients to target for mitigation, and maybe also which associated factors to pursue further.

However, this is only half the battle. Systems thinkers will recognize this model as a dead buffalo: a laundry list with unidirectional causality. The real situation is rich in feedback, including a lot of things that probably don’t get measured, and therefore don’t end up in the data for consideration by the algorithm. For example:

Infections aren’t just a random event for the patient; they happen for reasons that are larger than the patient. Even worse, there are positive feedbacks that can make prevention of infections, and errors more generally, hard to manage. For example, as the number of patients with infections rises, workload goes up, which creates time pressure and fatigue. That induces shortcuts and errors that create risk for patients, leading to more infections. Infections spread to other patients. Fatigued staff burn out and turn over faster, which dilutes the staff experience that might otherwise mitigate risk. (Experience, like many other dynamics, is not shown above.)

An algorithm that predicts risk in this context is certainly useful, because anything that reduces risk helps to diminish the gain of the vicious cycles. But it’s no longer so clear what to do with the patient assessments. Time spent on staff education and action for risk mitigation has to come from somewhere, and therefore might have unintended consequences that aren’t assessed by the algorithm. The algorithm is actually blind in two ways: it can’t respond to any input (like staff fatigue or skill) that isn’t in the data, and it probably  isn’t statistically smart enough to deal with the separation of cause and effect in time and space that arises in a feedback system.

Deep learning systems like Alpha Go Zero might learn to deal with dynamics. But so far, high performance requires very large numbers of exemplars for reinforcement learning, and that’s never going to happen in a community hospital dataset. Then again, we humans aren’t too good at managing dynamic complexity either. But until the machines take over, we can build dynamic models to sort these problems out. By taking an endogenous point of view, we can put machine learning in context, refine our understanding of leverage points, and redesign systems for greater performance.

Footing the bill for Iraq

Back in 2002, when invasion of Iraq was on the table and many Democrats were rushing patriotically to the President’s side rather than thinking for themselves, William Nordhaus (staunchest critic of Limits) went out on a limb a bit to attempt a realistic estimate of the potential cost.

All the dangers that lead to ignoring or underestimating the costs of war can be reduced by a thoughtful public discussion. Yet neither the Bush administration nor the Congress – neither the proponents nor the critics of war – has presented a serious estimate of the costs of a war in Iraq. Neither citizens nor policymakers are able to make informed judgments about the realistic costs and benefits of a potential conflict when no estimate is given.

His worst case: about \$755 billion direct (military, peacekeeping and reconstruction) plus indirect effects totaling almost \$2 trillion for a decade of conflict and its aftermath.

Nordhaus’ worst case is pretty close to actual direct spending in Iraq to date. But with another trillion for Afghanistan and 2 to 4 in the pipeline from future obligations related to the war, the grand total is looking like a lowball estimate. Other pre-invasion estimates, in the low billions, look downright ludicrous.

Recent news makes Nordhaus’ parting thought even more prescient:

Particularly worrisome are the casual promises of postwar democratization, reconstruction, and nation-building in Iraq. The cost of war may turn out to be low, but the cost of a successful peace looks very steep. If American taxpayers decline to pay the bills for ensuring the long-term health of Iraq, America would leave behind mountains of rubble and mobs of angry people. As the world learned from the Carthaginian peace that settled World War I, the cost of a botched peace may be even higher than the price of a bloody war

Self-generated seasonal cycles

This time of year, systems thinkers should eschew sugar plum fairies and instead dream of Industrial Dynamics, Appendix N:

Self-generated Seasonal Cycles

Industrial policies adopted in recognition of seasonal sales patterns may often accentuate the very seasonality from which they arise. A seasonal forecast can lead to action that may cause fulfillment of the forecast. In closed-loop systems this is a likely possibility. The analysis of sales data in search of seasonality is fraught with many dangers. As discussed in Appendix F, random-noise disturbances contain a broad band of component frequencies. This means that any effort toward statistical isolation of a seasonal sales component will find some seasonality in the random disturbances. Should the seasonality so located lead to decisions that create actual seasonality, the process can become self-regenerative.

Self-induced seasonality appears to occur many places in American industry. Sometimes it is obvious and clearly recognized, and perhaps little can be done about it. An example of the obvious is the strong seasonality in items such as cameras sold in the Christmas trade. By bringing out new models and by advertising and other sales promotion in anticipation of Christmas purchases, the industry tends to concentrate its sales at this particular time of year.

Other kinds of seasonality are much less clear. Almost always when seasonality is expected, explanations can be found to justify whatever one believes to be true. A tradition can be established that a particular item sells better at a certain time of year. As this “fact” becomes more and more widely believed, it may tend to concentrate sales effort at the time when the customers are believed to wish to buy. This in turn still further heightens the sales at that particular time.

Facebook has climbed out of its 2012 doldrums to a market cap of \$115 billion today. So, I’ve updated my user tracking and valuation model, just for kicks.

As in my last update, user growth continues to modestly exceed the original estimates. The user “carrying capacity” now is about 1.35 billion users, vs. .95 originally (K950 on graph) and 1.07 in 2012 – within the range of scenarios I originally ran, but well above the “best guess”. My guess is that the model will continue to underpredict for a while, because this is an inevitable pitfall of using a single diffusion process to represent what is surely the aggregate of several processes – stationary vs. mobile, different regions and demographics, etc. Of course, in the long run, users could also go down, which the basic logistic model can’t represent.

You can see what’s going on if you plot growth against users -the right tail doesn’t go to 0 as fast as the logistic assumes:

User growth probably isn’t a huge component of valuation, because these are modest differences on a percentage basis. Marginal users may be less valuable as well.

With revenue per user at a constant \$7/user/year, and 30% margins, and the current best-guess model, FB is now worth \$35 billion. What does it take to get to the ballpark of current market capitalization? Here’s one way:

• The carrying capacity ceiling for users continues to grow to 2 billion, and
• revenue per user rises to \$25/user/year

This preserves some optimistic base case assumptions,

• The risk-free interest rate takes 5 more years to rise substantially above 0 to a (still low) long term rate of 3%
• Margins stay at 30% as in 2009-2011 (vs. 18% y.t.d.)

Think it’ll happen?