gwern comments on Open Thread, September, 2010-- part 2

gwern 26 Sep 2010 2:33 UTC
3 points
My general thought is that so little data is needed to identify you, that the dataset can be enormously noisy and still identify you. And if your fake data is just randomly generated, isn’t that all it is, noise?

(I saw a paper about medical datasets, I think, that showed that you couldn’t anonymize the data successfully and still have a useful dataset; I don’t have it handy, but it’s not hard to find people saying things like, with the Netflix dataset, that it can’t be done: http://33bits.org/2010/03/15/open-letter-to-netflix/ )
- [deleted] 26 Sep 2010 2:41 UTC
  3 points
  Parent
  I’ve heard about the medical datasets.
  
  Noise is a pretty interesting thing, and the possibility of “denoising” depends a lot on the kind of noise. White noise is the easiest to get rid of; malicious noise, which isn’t random but targeted to be “worst-case,” can thwart denoising methods that were designed for white noise.