Making Hay

If you read this blog (and, well, you do), you probably weren’t surprised  by Nassim Taleb’s recent claim that rapidly increasing amounts of data can give rise to rapidly increasing amounts of bad analysis.

Taleb’s observation leans more to the mathematical than the behavioral.  In brief: greater numbers of variables per observation combined with greater numbers of observations give rise to more false correlations.  As much as we love the graph showing the nonlinearity of spurious correlations (heck, we like just saying “the nonlinearity of spurious correlations”), the point is far from novel.

More interesting is the implication that this is a problem.  Why should an increase in spurious correlations be an issue for companies working with such data?  If, in Taleb’s analogy, “the problem is that the needle comes in an increasingly larger haystack,” why is the increase in the hay-to-needle ratio anything other than a quantitative challenge – precisely the kind of problem that should pose no concern at all for the kinds of server farms giving rise to the correlations in the first place?

The answer stems from a methodological shift that has paralleled the increases in processing power and data availability.  One way to frame the issue would be to say a lot of “analysts” are confusing the necessary with the sufficient.  Or we could suggest they are confusing correlation and causation the way many people do.  But the most direct way to put it is to note that a lot of software-enabled “analysts” are just plain lazy.

Morphing data into information requires hands-on work and detailed knowledge of the data source to avoid GIGO pitfalls.  Shaping information into insight involves generating hypotheses that reflect an understanding of the data’s context and dynamics.  Filtering insight into knowledge demands testing holdout samples to confirm the hypotheses.  And turning knowledge into wisdom means repeating these steps forever.

Correlation is a fine place to start analysis.  And it’s an important place to conclude it.  But there are many steps on the journey other than getting a correlation from Excel (or SAS or R).  If the rise in spurious correlations is a problem for companies working with such data, they probably just aren’t working very hard.