So, given that we’ve got a high concentration of technical people around here, maybe someone can answer this for me:
Could it ever be possible to do some kind of counter-data mining?
Everybody has some publicly-available info on the internet—information that, in general, we actually want to be publicly available. I have an online presence, sometimes under my real name and sometimes under aliases, and I wouldn’t want to change that.
But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you’re gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content. There’s even a tool (www.recordedfuture.com) that purportedly helps clients like the CIA predict subjects’ future movements. But so far, I’ve never heard of attempts to make data mining harder for the snoops. I’m not talking about advice like “Don’t put anything online you wouldn’t want in the newspaper.” I’m interested in technical solutions—the equivalent of cryptography.
It’s a pipe dream, but it might not be impossible. Here’s Wikipedia background, with good additional references, for nonlinear dimensionality reduction techniques, which is one of my academic interests. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
These techniques involve taking a cloud of points in a high-dimensional space, and deciphering the low-dimensional manifold on which they lie. In other words, extracting salient information from data. And there are standard manifolds where various techniques are known to fail—it’s hard for algorithms to recognize the “swiss roll,” for instance.
These hard cases are disappointments for the data miner, but they ought to be opportunities for the counter-data miner, right? Could it be possible to exploit the hard cases to make it more difficult for the snoops? One practical example of something like this already exists: the distorted letters in a CAPTCHA are “hard cases” for automated image recognition software.
I write data mining software professionally, and one weakness that comes to mind is the deduplication process. In order to combine data from different sources, the software has to determine which entries correspond to the same person. It does this by looking for common elements with a low false positive rate. If two records have the same phone number, email address, site plus account name, social security number, or name-address pair, they are almost certainly the same person, so they will be combined. This relation is transitive, so if A has the same phone number as B and B has the same email address as C, then A, B, and C will all be assumed to be the same person.
You can subvert this by creating records which map as equivalent to two different people, such as by having one person’s phone number and another person’s email address. If a data source contains too many entries like this, it’s useless unless there’s an easy way to filter them out. If a data source contains just a few entries like this, data miners are likely to get confused. Note that this is not necessarily a good idea, since having a computerized bureaucracy be confused about your identity can have very inconvenient consequences. It is also possible to detect and defeat this strategy, by looking for deduplications with strange results, but this is tricky in practice, since people often really do have multiple names (maiden names, alternate spellings), phone numbers, email addresses etc.
But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you’re gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content.
Really? Where can I find said algorithms? Knowing how they work would obviously be a useful way of thwarting them.
I’ve heard of one for determining your sexual orientation (if you don’t reveal it on your info page), but it’s based on the revealed sexual orientations of your friends (if a lot are gay, you probably are too), so it’s harder to thwart than, say, something based on your favorite songs.
Apparently, it looks at the self-reported gender and sexual orientation of your Facebook friends, and uses that information to guess your own sexual orientation. Here’s how I would do that:
Gather three variables: your gender, the male/female ratio of your friends, and the ratio of gay-or-bisexual to straight people among those of your friends who state their own sexual orientation. If I wanted to be extra-fancy, I might also include a sparse array of events and clubs that the person was signed up for.
Apply some standard machine learning tools to this, discretizing variables if necessary. Use people who report their sexual orientation as training and testing data.
Practice my evil villain laugh.
In order to defend against this, you could apply steps 1 and 2, then look at what the machine learning program tells you. Try to match its profile of a straight person. Then you can remain Facebook-closeted even in the face of the all-seeing electronic gaydar.
It’s theoretically obvious that you can try to do it this way with a nontrivial chance of success, but not at all obvious that given enough skill and work, success is assured (which was the claim). The latter would require (knowledge of) actual experiments.
Try to match its profile of a straight person. Then you can remain Facebook-closeted even in the face of the all-seeing electronic gaydar.
I have no problem with people knowing that I’m gay. Come to think of it I have no problem with people knowing my social security number. (We don’t even have a commonly used equivalent here. Although driver’s licence numbers and birth citificate IDs are sometimes useful.)
My general thought is that so little data is needed to identify you, that the dataset can be enormously noisy and still identify you. And if your fake data is just randomly generated, isn’t that all it is, noise?
(I saw a paper about medical datasets, I think, that showed that you couldn’t anonymize the data successfully and still have a useful dataset; I don’t have it handy, but it’s not hard to find people saying things like, with the Netflix dataset, that it can’t be done: http://33bits.org/2010/03/15/open-letter-to-netflix/ )
Noise is a pretty interesting thing, and the possibility of “denoising” depends a lot on the kind of noise. White noise is the easiest to get rid of; malicious noise, which isn’t random but targeted to be “worst-case,” can thwart denoising methods that were designed for white noise.
I think there are three different problems here, each of which calls for different solutions.
Problem 1 is data floating around that is intrinsically harmful for strangers to have—your credit card number, for example. Sometimes you put that number online, and you would really rather it not be widely distributed. This problem can probably be solved by straightforward cryptography; if your CC# is never sent in the clear and changes every few weeks, and you don’t buy from an untrustworthy vendor more than once every few weeks, you’ll mostly be fine.
Problem 2 is data floating around that can be assembled to draw generalizations about your personal life—e.g., you’re gay. Perhaps I’m speaking from a position of excess privilege, but one good medicine for that sort of thing is sunshine—if you find a job and a support network that you don’t have to keep secrets from, you can’t be blackmailed and won’t need that sort of privacy as much. I’m skeptical that online data-mining will reveal much more of this kind of personal info about anyone than casual observation would in the near future; if you’re constantly listening to Justin Timberlake, someone will eventually figure out that you like Justin Timberlake even if you never go online.
Problem 3 is people predicting your next move from your previous history. That’s kind of spooky and could be dangerous if you have enemies, but the solution is straightforward: vary your routine! If you add a bit of spontaneity to your life, the men in the black suits will have to use a satellite to find you; maybe you’ll get lucky and their budget will get cut.
It’s 2 that I’m worried about; or, rather, not specifically worried for myself, but think is an interesting problem.
If information is really supposed to be private (credit card number) then you’re right, straightforward cryptography is the answer. But a lot of the time, we make information public, with the understanding that the viewer is a person, not a bot, and a person who has some reason to look (most people viewing my LW posts are people who read LW.) We want it to be public, sure, but we don’t plan it to quite as public as “all instantly assemblable and connectable to my real name.” In practice there are degrees of publicness.
As a personal issue, yeah, I’d like my job and support network to be the kind that wouldn’t be shocked by what they find about me.
Hm. OK, just brainstorming here; not sure if this idea is valuable.
Suppose you found a way to -detect- when someone was assembling your data? Like if all your public posts had little electronic watchdogs on them that reported in when they were viewed, and if a sufficiently high percentage of the watchdogs report in on the same minute, or if a sufficiently broad cross-section of the watchdogs report in on the same minute, then you know you’re being scanned, and the watchdogs try to trace the entity doing the scanning?
And then if all the people who didn’t like being bot-scanned cooperated and shared their information about who the scanners were so as to trace them more effectively and confirm the scanners’ real identities? You could maybe force them to stop via legal action, or, if the gov’t won’t cooperate, just fight back by exposing the private info of the owners/employees of the bots?
Obligatory fiction reference: Paranoid Linux. Unfortunately the real-life project that the post is about seems to be dead; no idea if there are any similar efforts still active.
I wasn’t aware of this, but this was pretty much exactly my idea, except that the chaff would be targeted to make standard algorithms draw a blank (basically, whenever the algorithm wants something to be sparse, we make it really not sparse.)
Same idea is also in Vernor Vinge’s “Rainbow’s End”, so-called “Friends of Privacy”, and similar idea in Stephenson’s “Anathem”—that variant is termed “bogons”.
I write data mining software professionally, and one vulnerability that comes to mind is the deduplication process. In order to combine data from different sources, software has to recognize that two records correspond to the same person. To determine whether two entries describe the same person, they look for common elements which have a low false positive rate: phone numbers, email addresses, social security numbers and having the same account name on the same site are highly reliable; name-address pairs work but are less reliable; and having the same account name on different sites works but is less reliable still. This relation is transitive, so if A has the same phone number as B and B has the same phone number as C, then A, B, and C are all mapped to the same person.
One way to confuse this process is to create entries that evaluate as equivalent to two or more different people—ie, take one person’s email address and a different person’s phone number. The consequence of this would be to cause the software to think they’re the same person. Creating a lot of entries like this in one data source will make that source useless for data mining, unless the data miners find a way to filter them out. Creating a small number of entries like this will cause data miners to get confused when dealing with the specific people for whom entries like this exist. Note that this is not necessarily a good idea, since having a computerized bureaucracy think you’re someone else can lead to very inconvenient consequences.
So, given that we’ve got a high concentration of technical people around here, maybe someone can answer this for me:
Could it ever be possible to do some kind of counter-data mining?
Everybody has some publicly-available info on the internet—information that, in general, we actually want to be publicly available. I have an online presence, sometimes under my real name and sometimes under aliases, and I wouldn’t want to change that.
But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you’re gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content. There’s even a tool (www.recordedfuture.com) that purportedly helps clients like the CIA predict subjects’ future movements. But so far, I’ve never heard of attempts to make data mining harder for the snoops. I’m not talking about advice like “Don’t put anything online you wouldn’t want in the newspaper.” I’m interested in technical solutions—the equivalent of cryptography.
It’s a pipe dream, but it might not be impossible. Here’s Wikipedia background, with good additional references, for nonlinear dimensionality reduction techniques, which is one of my academic interests. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction) These techniques involve taking a cloud of points in a high-dimensional space, and deciphering the low-dimensional manifold on which they lie. In other words, extracting salient information from data. And there are standard manifolds where various techniques are known to fail—it’s hard for algorithms to recognize the “swiss roll,” for instance.
These hard cases are disappointments for the data miner, but they ought to be opportunities for the counter-data miner, right? Could it be possible to exploit the hard cases to make it more difficult for the snoops? One practical example of something like this already exists: the distorted letters in a CAPTCHA are “hard cases” for automated image recognition software.
Does anybody have thoughts on this?
I write data mining software professionally, and one weakness that comes to mind is the deduplication process. In order to combine data from different sources, the software has to determine which entries correspond to the same person. It does this by looking for common elements with a low false positive rate. If two records have the same phone number, email address, site plus account name, social security number, or name-address pair, they are almost certainly the same person, so they will be combined. This relation is transitive, so if A has the same phone number as B and B has the same email address as C, then A, B, and C will all be assumed to be the same person.
You can subvert this by creating records which map as equivalent to two different people, such as by having one person’s phone number and another person’s email address. If a data source contains too many entries like this, it’s useless unless there’s an easy way to filter them out. If a data source contains just a few entries like this, data miners are likely to get confused. Note that this is not necessarily a good idea, since having a computerized bureaucracy be confused about your identity can have very inconvenient consequences. It is also possible to detect and defeat this strategy, by looking for deduplications with strange results, but this is tricky in practice, since people often really do have multiple names (maiden names, alternate spellings), phone numbers, email addresses etc.
Really? Where can I find said algorithms? Knowing how they work would obviously be a useful way of thwarting them.
I’ve heard of one for determining your sexual orientation (if you don’t reveal it on your info page), but it’s based on the revealed sexual orientations of your friends (if a lot are gay, you probably are too), so it’s harder to thwart than, say, something based on your favorite songs.
Here’s the article where I heard about the gay facebook page thing:
http://www.boston.com/bostonglobe/ideas/articles/2009/09/20/project_gaydar_an_mit_experiment_raises_new_questions_about_online_privacy/?page=full
Here’s where I read about calculating SSN’s:
http://arstechnica.com/tech-policy/news/2009/07/social-insecurity-numbers-open-to-hacking.ars
Apparently, it looks at the self-reported gender and sexual orientation of your Facebook friends, and uses that information to guess your own sexual orientation. Here’s how I would do that:
Gather three variables: your gender, the male/female ratio of your friends, and the ratio of gay-or-bisexual to straight people among those of your friends who state their own sexual orientation. If I wanted to be extra-fancy, I might also include a sparse array of events and clubs that the person was signed up for.
Apply some standard machine learning tools to this, discretizing variables if necessary. Use people who report their sexual orientation as training and testing data.
Practice my evil villain laugh.
In order to defend against this, you could apply steps 1 and 2, then look at what the machine learning program tells you. Try to match its profile of a straight person. Then you can remain Facebook-closeted even in the face of the all-seeing electronic gaydar.
It’s theoretically obvious that you can try to do it this way with a nontrivial chance of success, but not at all obvious that given enough skill and work, success is assured (which was the claim). The latter would require (knowledge of) actual experiments.
I have no problem with people knowing that I’m gay. Come to think of it I have no problem with people knowing my social security number. (We don’t even have a commonly used equivalent here. Although driver’s licence numbers and birth citificate IDs are sometimes useful.)
My general thought is that so little data is needed to identify you, that the dataset can be enormously noisy and still identify you. And if your fake data is just randomly generated, isn’t that all it is, noise?
(I saw a paper about medical datasets, I think, that showed that you couldn’t anonymize the data successfully and still have a useful dataset; I don’t have it handy, but it’s not hard to find people saying things like, with the Netflix dataset, that it can’t be done: http://33bits.org/2010/03/15/open-letter-to-netflix/ )
I’ve heard about the medical datasets.
Noise is a pretty interesting thing, and the possibility of “denoising” depends a lot on the kind of noise. White noise is the easiest to get rid of; malicious noise, which isn’t random but targeted to be “worst-case,” can thwart denoising methods that were designed for white noise.
I think there are three different problems here, each of which calls for different solutions.
Problem 1 is data floating around that is intrinsically harmful for strangers to have—your credit card number, for example. Sometimes you put that number online, and you would really rather it not be widely distributed. This problem can probably be solved by straightforward cryptography; if your CC# is never sent in the clear and changes every few weeks, and you don’t buy from an untrustworthy vendor more than once every few weeks, you’ll mostly be fine.
Problem 2 is data floating around that can be assembled to draw generalizations about your personal life—e.g., you’re gay. Perhaps I’m speaking from a position of excess privilege, but one good medicine for that sort of thing is sunshine—if you find a job and a support network that you don’t have to keep secrets from, you can’t be blackmailed and won’t need that sort of privacy as much. I’m skeptical that online data-mining will reveal much more of this kind of personal info about anyone than casual observation would in the near future; if you’re constantly listening to Justin Timberlake, someone will eventually figure out that you like Justin Timberlake even if you never go online.
Problem 3 is people predicting your next move from your previous history. That’s kind of spooky and could be dangerous if you have enemies, but the solution is straightforward: vary your routine! If you add a bit of spontaneity to your life, the men in the black suits will have to use a satellite to find you; maybe you’ll get lucky and their budget will get cut.
It’s 2 that I’m worried about; or, rather, not specifically worried for myself, but think is an interesting problem.
If information is really supposed to be private (credit card number) then you’re right, straightforward cryptography is the answer. But a lot of the time, we make information public, with the understanding that the viewer is a person, not a bot, and a person who has some reason to look (most people viewing my LW posts are people who read LW.) We want it to be public, sure, but we don’t plan it to quite as public as “all instantly assemblable and connectable to my real name.” In practice there are degrees of publicness.
As a personal issue, yeah, I’d like my job and support network to be the kind that wouldn’t be shocked by what they find about me.
Hm. OK, just brainstorming here; not sure if this idea is valuable.
Suppose you found a way to -detect- when someone was assembling your data? Like if all your public posts had little electronic watchdogs on them that reported in when they were viewed, and if a sufficiently high percentage of the watchdogs report in on the same minute, or if a sufficiently broad cross-section of the watchdogs report in on the same minute, then you know you’re being scanned, and the watchdogs try to trace the entity doing the scanning?
And then if all the people who didn’t like being bot-scanned cooperated and shared their information about who the scanners were so as to trace them more effectively and confirm the scanners’ real identities? You could maybe force them to stop via legal action, or, if the gov’t won’t cooperate, just fight back by exposing the private info of the owners/employees of the bots?
If you found such a way, then a lot of interesting consequences would follow.
Of course, there is no such way for the same reason that the history of DRM is a history of failure.
Obligatory fiction reference: Paranoid Linux. Unfortunately the real-life project that the post is about seems to be dead; no idea if there are any similar efforts still active.
I wasn’t aware of this, but this was pretty much exactly my idea, except that the chaff would be targeted to make standard algorithms draw a blank (basically, whenever the algorithm wants something to be sparse, we make it really not sparse.)
Damn, Cory Doctorow, I thought I was clever.
Same idea is also in Vernor Vinge’s “Rainbow’s End”, so-called “Friends of Privacy”, and similar idea in Stephenson’s “Anathem”—that variant is termed “bogons”.
This paper might be relevant.
I write data mining software professionally, and one vulnerability that comes to mind is the deduplication process. In order to combine data from different sources, software has to recognize that two records correspond to the same person. To determine whether two entries describe the same person, they look for common elements which have a low false positive rate: phone numbers, email addresses, social security numbers and having the same account name on the same site are highly reliable; name-address pairs work but are less reliable; and having the same account name on different sites works but is less reliable still. This relation is transitive, so if A has the same phone number as B and B has the same phone number as C, then A, B, and C are all mapped to the same person.
One way to confuse this process is to create entries that evaluate as equivalent to two or more different people—ie, take one person’s email address and a different person’s phone number. The consequence of this would be to cause the software to think they’re the same person. Creating a lot of entries like this in one data source will make that source useless for data mining, unless the data miners find a way to filter them out. Creating a small number of entries like this will cause data miners to get confused when dealing with the specific people for whom entries like this exist. Note that this is not necessarily a good idea, since having a computerized bureaucracy think you’re someone else can lead to very inconvenient consequences.