This week, FiO returns to the issue of Google searches and the stock market, by looking at correlation via sunspots and Goldilocks. In a previous Figure It Out (Google Zipfed) we looked at Zipf’s law – that if you rank the frequency with which each word in a long piece of text occurs, then this frequency is inversely proportional to its rank in the frequency table. So the most frequently occurring word is twice as common as the second, which is in turn twice as common as the fourth ranked word. We tested this proposition for Google searches, and found that in the case of the FTSE100 companies, there was indeed a strong relationship between the number of Google search results for the FTSE100 and the rank of these results. We speculated that perhaps there is also a correlation between the number of search results and a company’s market capitalisation. To test this, we took the FTSE40 market capitalisation (as listed in Wikipedia at http://en.wikipedia.org/wiki/FTSE_100_Index, for 3rd October 2009) and plotted this against the number of Google searches for each of the companies:
(Source: Wikipedia and Capgemini analysis of Google results) What this shows is that there is little linear correlation between the FTSE40’s market capitalisation and the number of search results returned by a given search engine -even if we exclude the outlier (BP – surprise,surprise). Two things arise from this: firstly, did we ask the right question, and secondly do the data support the analysis? On the first point, we merely raised a hypothesis that there is a linear correlation between two sets of data. But – even if had found a correlation (linear or otherwise) – what would we have learnt? As every scientist knows, correlation does not imply causation: just because there is a correlation between X and Y, this does not mean that X causes Y. In fact, the question of whether X causes Y can get very convoluted. For example, in the 19th century, a link was proposed between agricultural output and sunspots. The logic was that natural fluctuations in sunspot activity affects weather, which leads to cyclical variation in agricultural output and therefore GDP. Since then, the hypothesis has attracted both supporters and critics, and has even been extended to other parts of the solar system (Henry Ludwell Moore proposed a causation between Venus and economic cycles – sadly, he never wrote his masterwork “Men are from Earth and Economists are from Venus”). In one of the more recent papers on the issue, a very strong correlation was observed between the number of sunspots and both US GDP and the Dow Jones Industrial Average (DJIA), over a period of 80 years (Modis, 2007 – “Sunspots, GDP and the stock market”, Technological Forecasting and Social Change, 74(8): 1508-1514). Even if this paper doesn’t settle the question of solar activity and economic output once and for all, it contains the now eerie observations that “the present upward excursions of the DJIA and GDP should continue until June 2008…From mid-2008 onward both the stock market and the GDP should move downward toward their long range trends”. In this case, even if there is no obvious causation, the mere fact of a correlation existing allowed for some useful forecasts to be made. One of the most interesting current debates in the correlation vs. causation debate is in climate change – and although there are actually well defined methods for testing whether a given hypothesis is true, we’ll leave the analysis for the experts to decide. (see previous FIO Warmer Days Ahead?) On the second point – do the data support the analysis – one of the complexities of correlation (in this case, linear correlation) is that the data have to be “just right” – not too hot, and not too cold (Goldilocks). As an example, there’s no point looking for a linear correlation if the data show heteroscedasticity – that is, they look like this:
In the above example, the data “fan out” at larger X values, which means that the strength of linear correlations may be overestimated. There are many other types of dataset where it would be foolish to look for a linear correlation, and others where there is no correlation at all. The upshot of both the correlation vs. causation debate and the Goldilocks approach is that whilst correlations are easy to create in Excel, they need to be treated with care. Indeed, as OR practitioners, we try and resolve the questions around are we asking the right question and do the data support the analysis so that any conclusions we come up with have a strong foundation in both the data and the techniques used to analyse them.