This week, FiO extends its reach from maths to languages and beyond, by looking at Zipf’s Law, Google searches, and avoiding RSI. Call me Ishmael George Kingsley Zipf GK Zipf looked at long texts and noticed that if you rank the frequency of each word, then this frequency is inversely proportional to its rank in the frequency table. His law has a complex mathematical formulation, but it means that it’s roughly as usual for a word to appear once as 10 times, 100 times or 1,000 – and typically a common word appears more frequently than an uncommon word by a factor of 10, 100 or more. As an example, this is the result for Moby Dick:
(Source: Words counted from the Project Gutenberg text of Moby Dick by Wikipedia user Radagast3) As an aside (for all you pub quiz fans and party bores out there), the red data points are words that appear only once in “Moby Dick” – and the term for a word appearing only once in a body of text (a language, an author’s work, a single book, etc.) is hapax legomenon. For example, in the whole of Shakespeare, the word Honorificabilitudinitatibus appears only once (luckily for us and Will’s Hungarian and Finnish translators). What happens when I google “Google”…? So what does all this have to do with Google? At FiO, we wondered whether Zipf’s Law holds in the Internet age, and not with normal English words, but proper nouns. To test this, we took the names of the FTSE100 companies (from http://www.stockmarketsreview.com/companies_ftse100/), and googled their names verbatim, without quote marks (e.g. Daily Mail and General Trust). We left out two companies, as their names were normal English words (Next and Resolution). We then ranked the number of search results from 1 to 98, and plotted the logarithms of both the rank (1 to 98) and frequency (i.e. the number of results returned by Google):
(Source: Capgemini analysis of Google results) These results show that Zipf’s Law does, in the main, hold in the Internet age….but wait (I hear you ask) – didn’t it take a while searching for 98 names, and ranking the search results? Why yes, it did – so isn’t there a quicker way? The 80-20 (or Pareto) rule says, roughly, that 80% of sales come from 20% of customers. If we were interested in the total number of search results returned by the FTSE100, could we have looked at only the top 20% – i.e. the top 20 or so names ranked by the number of search results? The answer is “yes”: the top 20% accounted for a whopping 89% of the total number of search results. So what has this diversion into language told us? Firstly, that FiO clearly needs to take a summer holiday, reading trashy Gordon Dan Brown books with a G&T at hand. Secondly, that some of the theories developed for the printed (p)age can be applied to the era of the Internet. Indeed, Zipf’s Law is applicable to all kinds of things – e.g. rankings cities by population – so it’s not surprising that it can be applied to search results (and there might be a correlation between the number of search results and a company’s market cap – but that’s for a future FiO). Thirdly, faced with large amounts of data, it’s often better to avoid “boiling the ocean” and concentrate on only the top 20% or so of data. This saves both time and money, and avoids developing Repetitive Strain Injury. And lastly, having the highest number of Internet search results isn’t always a positive thing: unfortunately the FTSE100 company with the #1 rank was BP.