Bad data, bad models

Baseline Scenario has a nice post on bad data:

To make a vast generalization, we live in a society where quantitative data are becoming more and more important. Some of this is because of the vast increase in the availability of data, which is itself largely due to computers. Some is because of the vast increase in the capacity to process data, which is also largely due to computers. …

But this comes with a problem. The problem is that we do not currently collect and scrub good enough data to support this recent fascination with numbers, and on top of that our brains are not wired to understand data. And if you have a lot riding on bad data that is poorly understood, then people will distort the data or find other ways to game the system to their advantage.

In spite of ubiquitous enterprise computing, bad data is the norm in my experience with corporate consulting. At one company, I had access to very extensive data on product pricing, promotion, advertising, placement, etc., but the information system archived everything inaccessibly on a rolling 3-year horizon. That made it impossible to see long term dynamics of brand equity, which was really the most fundamental driver of the firm’s success. Our experience with large projects includes instances where managers don’t want to know the true state of the system, and therefore refuse to collect or provide needed data – even when billions are at stake. And some firms jealously guard data within stovepipes – it’s hard to optimize the system when the finance group keeps the true product revenue stream secret in order to retain leverage over the marketing group.

People worry about garbage-in-garbage out, but modeling can actually be the antidote to bad data. If you pay attention to quality, the process of building a model will reveal all kinds of gaps in data. We recently discovered that various sources of vehicle fleet data are in serious disagreement, because of double-counting of transactions and interstate sales, and undercounting of inspections. Once data issues are known, a model can be used to remove biases and filter noise (your GPS probably runs a Kalman Filter to combine a simple physical model of your trajectory with noisy satellite measurements).

Not just any model will do; causal models are important. It’s hard to discover that your data fails to observe physical laws or other reality checks with a model that permits negative cows and buries the acceleration of gravity in a regression coefficient.

The problem is, a lot of people have developed an immune response against models, because there are so many that don’t pay attention to quality and serve primarily propagandistic purposes. The only antidote for that, I think, is to teach modeling skills, or at least model consumption skills, so that they know the right questions to ask in order to separate the babies from the bathwater.

Leave a Reply Cancel reply