Limits to Big Data

I’m skeptical of the idea that machine learning and big data will automatically lead to some kind of technological nirvana, a Star Trek future in which machines quickly learn all the physics needed for us to live happily ever after.

First, every other human technology has been a mixed bag, with improvements in welfare coming along with some collateral damage. It just seems naive to think that this one will be different.


These are not the primary problem.

Second, I think there are some good reasons to think that problems will get harder at the same rate that machines get smarter. The big successes I’ve seen are localized point prediction problems, not integrated systems with a lot of feedback. As soon as causality are separated in time and space by complex mechanisms, you’re into sloppy systems territory, where data may constrain only a few parameters at a time. Making progress in such systems will increasingly require integration of multiple theories and data from multiple sources.

People in domains that have made heavy use of big data increasingly recognize this:

Big data need big theory too

Abstract

The current interest in big data, machine learning and data analytics has generated the widespread impression that such methods are capable of solving most problems without the need for conventional scientific methods of inquiry. Interest in these methods is intensifying, accelerated by the ease with which digitized data can be acquired in virtually all fields of endeavour, from science, healthcare and cybersecurity to economics, social sciences and the humanities. In multiscale modelling, machine learning appears to provide a shortcut to reveal correlations of arbitrary complexity between processes at the atomic, molecular, meso- and macroscales. Here, we point out the weaknesses of pure big data approaches with particular focus on biology and medicine, which fail to provide conceptual accounts for the processes to which they are applied. No matter their ‘depth’ and the sophistication of data-driven methods, such as artificial neural nets, in the end they merely fit curves to existing data. Not only do these methods invariably require far larger quantities of data than anticipated by big data aficionados in order to produce statistically reliable results, but they can also fail in circumstances beyond the range of the data used to train them because they are not designed to model the structural characteristics of the underlying system. We argue that it is vital to use theory as a guide to experimental design for maximal efficiency of data collection and to produce reliable predictive models and conceptual knowledge. Rather than continuing to fund, pursue and promote ‘blind’ big data projects with massive budgets, we call for more funding to be allocated to the elucidation of the multiscale and stochastic processes controlling the behaviour of complex systems, including those of life, medicine and healthcare.

(Previous and next item via Mark Buchanan writing in Nature Physics.)

Hosni and Vulpiani look at this formally in physics:

FORECASTING IN LIGHT OF BIG DATA

Abstract.

Predicting the future state of a system has always been a natural motivation for science and practical applications. Such a topic, beyond its obvious technical and societal relevance, is also interesting from a conceptual point of view. This owes to the fact that forecasting lends itself to two equally radical, yet opposite methodologies. A reductionist one, based on the first principles, and the na¨ive-inductivist one, based only on data. This latter view has recently gained some attention in response to the availability of unprecedented amounts of data and increasingly sophisticated algorithmic analytic techniques. The purpose of this note is to assess critically the role of big data in reshaping the key aspects of forecasting and in particular the claim that bigger data leads to better predictions. Drawing on the representative example of weather forecasts we argue that this is not generally the case. We conclude by suggesting that a clever and context-dependent compromise between modelling and quantitative analysis stands out as the best forecasting strategy, as anticipated nearly a century ago by Richardson and von Neumann.

An even stronger result suggests that learnability is subject to Gödel’s incompleteness theorems:

Learnability can be undecidable

Abstract

The mathematical foundations of machine learning play a key role in the development of the field. They improve our understanding and provide tools for designing new learning paradigms. The advantages of mathematics, however, sometimes come with a cost. Gödel and Cohen showed, in a nutshell, that not everything is provable. Here we show that machine learning shares this fate. We describe simple scenarios where learnability cannot be proved nor refuted using the standard axioms of mathematics. Our proof is based on the fact the continuum hypothesis cannot be proved nor refuted. We show that, in some cases, a solution to the ‘estimating the maximum’ problem is equivalent to the continuum hypothesis. The main idea is to prove an equivalence between learnability and compression.

I think the bottom line is that we’re likely to benefit from AI in the same way we have benefitted from other technologies: it will help us to the extent that we manage it wisely. To the extent that we don’t, it’ll just be a more efficient way to recreate all the frailties of human cognition.

2 thoughts on “Limits to Big Data”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.