jimrandomh comments on Open Thread, September, 2010-- part 2

jimrandomh 26 Sep 2010 21:09 UTC
0 points
0
I write data mining software professionally, and one vulnerability that comes to mind is the deduplication process. In order to combine data from different sources, software has to recognize that two records correspond to the same person. To determine whether two entries describe the same person, they look for common elements which have a low false positive rate: phone numbers, email addresses, social security numbers and having the same account name on the same site are highly reliable; name-address pairs work but are less reliable; and having the same account name on different sites works but is less reliable still. This relation is transitive, so if A has the same phone number as B and B has the same phone number as C, then A, B, and C are all mapped to the same person.

One way to confuse this process is to create entries that evaluate as equivalent to two or more different people—ie, take one person’s email address and a different person’s phone number. The consequence of this would be to cause the software to think they’re the same person. Creating a lot of entries like this in one data source will make that source useless for data mining, unless the data miners find a way to filter them out. Creating a small number of entries like this will cause data miners to get confused when dealing with the specific people for whom entries like this exist. Note that this is not necessarily a good idea, since having a computerized bureaucracy think you’re someone else can lead to very inconvenient consequences.