The base rate fallacy – what is it, and why does it matter?

Publish date:

The base rate fallacy is common in analytics and especially in fraud analytics, and is often seen with an imbalanced target variable in Supervised Machine Learning.

First of all, a trigger warning: this post makes reference to COVID-19 in its illustration of the base rate fallacy. Secondly, a disclaimer: the example is just an illustration, and all numbers involved are deliberately contrived only for expositional purposes.

Suppose a test for some feature of interest (say, having COVID-19 in the UK, in August 2020) has 95% accuracy, in that 95% of those with that feature who take the test have a positive test result (or ‘positive classification’), and 95% of those without that feature who take the test have a negative test result (or ‘negative classification’). Suppose further, however, that that feature is actually very uncommon: say, only 1/125 people have COVID-19. Suppose that you’re tested, and the test comes back positive. What’s the actual probability that you have COVID-19 in such a situation?

Take a moment. Even if you’re not sure, write your best guess down.

The answer is actually only about 11%. If you concluded anything above that, you’d be committing the base rate fallacy. As a minimal definition, we can say that we make the base rate fallacy when we make a statistical inference that in some sense ‘ignores’ the base rate (or ‘prior probability’) of the feature of interest. Now if the verbal explanation to follow doesn’t resonate with you, stay with me; the diagram below should clarify things.

Often, we make the base rate fallacy by conflating the proportion of positive cases that are positively classified, on the one hand, with the proportion of positively classified cases that are positive, on the other. In situations where our test is highly accurate (for example, > 95%) but the prior probability, or base rate, of the feature of interest is low (say, just 1/125) then while the former (so the proportion of positive cases that are positively classified) will be high (95%, in our case, simply due to the accuracy of the test) the latter (so the proportion of positively classified cases that are positive) may well be very low (due to fact that the vast majority of the positively classified cases are actually false positives, due to the number of those taking the test who don’t have the feature of interest, due, ultimately, to the low base rate).

Suppose for a moment that the population of the UK is exactly 66,000,000. Let’s visualise this:

 

 

Base Rate Fallacy
Base Rate Fallacy
Image Source: Capgemini

Since the base rate of having COVID-19 is just 1/125, of 66,000,000 people, 440,000 have it and 65,560,000 don’t. Then of those 440,000, 418,000 test positive (because the accuracy of the test is 95%, and 95% of 440,000 is 418,000). There are 22,000 false negatives; that is, individuals who really have COVID-19, but who tested negativeOf the 65,560,000 who don’t have COVID-19, 62,282,000 test negative (again, due to the accuracy of test; 95% of 65,560,000 is 62,282,000). But the remaining 3,278,000 don’t have COVID-19, and yet tested positive; these are the false positive cases, and they form the vast majority (3,278,000/(418,000+3,278,000) = 88%) of the positively classified cases.Hence, the proportion of positively classified cases that are positive is just (418,000/3,696,000 = ~12%), whereas the proportion of positive cases that are positively classified is (418,000/440,000 = 95%).[1]

The base rate fallacy, as you might imagine, is extremely common in statistics and can trip us up, as individuals and as members of organisations, in a whole host of contexts. We may justify certain important decisions with reasoning that commits the base rate fallacy. These actions might even be drastically mistaken, in that we might be better off not doing anything, or doing something else entirely.If we were to receive a positive classification from a COVID-19 test in a situation like the one described above, we might be moved to panic, when the numbers suggest that that might not be necessary. There are a multitude of other situations in which our committing the base rate fallacy could lead us, and the organisations of which we’re members, down potentially expensive or otherwise harmful paths:

  • A new commercial aeroplane fails a safety check; the test is highly (but not wholly) accurate, and the base rate of new unsafe aeroplanes is very low. If the aeroplane is hastily modified without careful diagnostics, we could damage an expensive (and possibly perfectly good) machine;
  • An individual is flagged by a government system as being a threat to national security; the test is very accurate, but the base rate of being such a threat is extremely low. If the individual is innocent, but is subjected to Kafka-esque surveillance and trial just on the basis of the relevant flagging, then we may well have acted unjustly and unlawfully;
  • An email is detected as malware by a filter; the test is highly accurate, but the base rate of malware is very low. If we are working for a company developing and selling such a filter, we need to be aware of such facts on pain of poor sales.

Hence, it’s very important to be wary of our tendency to fall into such traps when we make statistical inferences.  Keep an eye out for the fallacy in your reasoning, as well as any you read or hear.

[1]  The false negative rate (the proportion of positive cases, negatively classified) is known as the Type 2 Error, and the false positive rate (the proportion of negative cases, positively classified) is known as the Type 1 Error; in our example, these are assumed to be the same at 5%. But this isn’t the case in reality. At the time of writing, there are actually 410,000 confirmed COVID-19 cases in the UK. The testing sensitivity (that is, the proportion of people with COVID-19 who are tested, and are correctly classified as having the virus by the test) is much less than 95%, and is estimated to be closer to 70%. The test’s specificity (that is, the proportion of people without the virus who are tested, and are correctly classified as not having it) is about 95%. Thus the real Type 1 Error is estimated to be about 5%, and the Type 2 Error is estimated at 30%.

 

Author


Harry Long,

Data Scientist

 

Related Posts

Automation & AI

The difference engine

Date icon November 3, 2021

Capgemini’s Digital Global Enterprise Model platform reflects the unique and individual...

Automation & AI

Mapping attributes of an ethical AI Journey: from concept to full deployment

Date icon September 1, 2021

Thinking about how AI can become less of a one-time experiment and more of a long-term...

Insight Driven Enterprise

How will UK air travel bounce back from a Global Pandemic?

Date icon August 5, 2021

The winners of the Visual Analytics Competition explain their process and findings when...