Limits to Big Data

I’m skeptical of the idea that machine learning and big data will automatically lead to some kind of technological nirvana, a Star Trek future in which machines quickly learn all the physics needed for us to live happily ever after.

First, every other human technology has been a mixed bag, with improvements in welfare coming along with some collateral damage. It just seems naive to think that this one will be different.

These are not the primary problem.

Second, I think there are some good reasons to think that problems will get harder at the same rate that machines get smarter. The big successes I’ve seen are localized point prediction problems, not integrated systems with a lot of feedback. As soon as causality are separated in time and space by complex mechanisms, you’re into sloppy systems territory, where data may constrain only a few parameters at a time. Making progress in such systems will increasingly require integration of multiple theories and data from multiple sources.

People in domains that have made heavy use of big data increasingly recognize this: Continue reading “Limits to Big Data”

All data are wrong!

Simple descriptions of the Scientific Method typically run like this:

  • Collect data
  • Look for patterns
  • Form hypotheses
  • Gather more data
  • Weed out the hypotheses that don’t fit the data
  • Whatever survives is the truth

There’s obviously more to it than that, but every popular description I’ve seen leaves out one crucial aspect. Frequently, when the hypothesis doesn’t fit the data, it’s the data that’s wrong. This is not an invitation to cherry pick your data; it’s just recognition of a basic problem, particularly in social and business systems.

Any time you are building an integrated systems model, it’s likely that you will have to rely on data from a variety of sources, with differences in granularity, time horizons, and interpretation. Those data streams have probably never been combined before, and therefore they haven’t been properly vetted. They’re almost certain to have problems. If you’re only looking for problems with your hypothesis, you’re at risk of throwing the good model baby out with the bad data bathwater.

The underlying insight is that data is not really distinct from models; it comes from processes that are full of implicit models. Even “simple” measurements like temperature are really complex and assumption-laden, but at least we can easily calibrate thermometers and agree on the definition and scale of Kelvin. This is not always the case for organizational data.

A winning approach, therefore, is to pursue every lead:

  • Is the model wrong?
    • Does it pass or fail extreme conditions tests, conservation laws, and other reality checks?
    • How exactly does it miss following the data, systematically?
    • What feedbacks might explain the shortcomings?
  • Is the data wrong?
    • Do sources agree?
    • Does it mean what people think it means?
    • Are temporal patterns dynamically plausible?
  • If the model doesn’t fit the data, which is to blame?

When you’re building a systems model, it’s likely that you’re a pioneer in uncharted territory, and therefore you’ll learn something new and valuable either way.