One of the questions that get asked regularly is ‘what is a good set of matching criteria’ and the answer is of course ‘it depends on what good quality information you can get’. One piece that is put forwards is the combination of name & date of birth as being a good indicator of uniqueness. Now I’ve got a very common name and I’ve not only met people with the same birthday as me with the same name I’ve met people with the same name born within a couple of days, I’m also fairly sure that there is someone out there born on the same day as me with the same name as I’ve had some odd ‘matches’ when talking to people in call centers.
But just how likely is a match? Well building on the Birthday Paradox, which shows that if you have 23 people in a room its better than 50/50 that two will have the same birthday, and for most people at school where the set is restricted to people around your age this normally meant two people with the same birth date (day, month, year), on one occasion at school I was in a class with two people with a birthday of August 31st so the question of ‘youngest’ in the year was a matter of a few hours.
Taking the maths behind the Birthday Paradox and adding in a few more variables, for instance assuming that 100 years will cover everyone and that ages are uniformly spread (this is clearly a false assumption but one which makes it less rather than more likely to find a match) we take the old maths of the Birthday Paradox
and expand for 100 years to give
So now the probability that anyone in a group has exactly the same birth date as another p(n) can be calculated. Lets take the 50/50 point as where we decide that this really isn’t the sort of metric we should rely on and try and understand just how big the 50/50 group is. This remember covers of 36,525 days, a rudimentary maths approach would say that 23/365 (6.3%) would indicate around 2300 people would be required, probability however isn’t rudimentary maths and instead the answer is that just 226 people are required until a 50/50 match is probable.
|Match Probability||Day and Month match||Full Date match|
|10%||10 people||89 people|
|50%||23 people||226 people|
|80%||35 people||343 people|
|95%||47 people||468 people|
|99%||57 people||579 people|
Clearly therefore name and date of birth is far from satisfactory as a match criteria if we are dealing with anything other than common names. This also however tells us something about match probabilities within our MDM solutions, and critically it has an impact on how we view matching for Big Data, the larger the data set the more likely it will be that false positives occur. This means that when we create probabilities for name matching it should be driven not off a fixed assessment of likelihood but on a combination of factors including the number of instances that the name appears in the source records. This means that if there are only 89 records then you have ~90% certainty that any match is the result of it being the same individual, at 468 records your certainty is down to 5%.
Traditionally the quality of the match is improved by tying it to another attribute, physical address, email address etc which significantly improves the likelihood of a decent match and reduces the odds of a false positive. This sort of approach has to continue and these approaches must be backed by a real-world verification.
The reason for this post is that a client asked me how they could match their customers against Facebook, Twitter and the like to give a high-probability match of their customer base against this publically shared information. My position was that they needed customers to self-identify as it was the only way to guarantee the match as many people deliberately change information on social channels and there is a risk of brand damage if you get it wrong or if people perceive you as ‘snooping’ on their lives.
A vendor however has said that they could do this matching for them with ‘very high certainty’ based on the social information. I’d assumed this would include some sort of geo-matching or require a minimum set of information, for instance place of birth, current address, etc to be available. ‘No’ claimed the vendor, all they needed was name and date of birth and they could do the match. Now given that Facebook has >600m customers and there certainly seem to be more than 200 Steve Jones I’m confident that the results are going to be poor.
News now in: Yup the results are poor: Maths 1 – 0 Blind ignorance