jsteinhardt comments on [link] New essay summarizing some of my latest thoughts on AI safety

jsteinhardt 3 Nov 2015 17:07 UTC
4 points
0
Thanks for writing this; a couple quick thoughts:

For example, it turns out that a learning algorithm tasked with some relatively simple tasks, such as determining whether or not English sentences are valid, will automatically build up an internal representation of the world which captures many of the regularities of the world – as a pure side effect of carrying out its task.

I think I’ve yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world. For some papers that refute this claim, see e.g.

http://arxiv.org/abs/1312.6199 http://arxiv.org/abs/1412.6572

I think the Degrees of Freedom thesis is a good statement of one of the potential problems. Since it’s essentially making a claim about whether a certain very complex statistical problem is identifiable, I think it’s very hard to know whether it’s true or not without either some serious technical analysis or some serious empirical research—which is a reason to do that research, because if the thesis is true then that has some worrisome implications about AI safety.
- SilentCal 9 Nov 2015 19:19 UTC
  3 points
  0
  Parent
  http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html is also relevant—tl;dr Google Photos classifies a leopard-print sofa as a leopard. I think this lends credence to the ‘treacherous turn’ insofar as it’s an example of a classifier seeming to perform well and breaking down in edge cases.
  - jacob_cannell 9 Nov 2015 19:45 UTC
    7 points
    0
    Parent
    The classifier isn’t breaking down—it was trained to do well across the entire training set using a small amount of computation for each inference and a reasonable (larger) amount of computation for training.
    
    Human’s fastest recognition capability still takes 100 ms or so, and operating in that mode (rapid visual presentation), human inference accuracy is considerably less capable than modern ANNs—which classify using less time and also around 1000x less neurons/synapses.
    
    I would bet that humans often make similar mistakes in fast recognition. And even if humans don’t make this specific mistake, it doesn’t matter because they make more total mistakes in other categories.
    
    The fact that humans can do better given considerably more time and enormously more neural resources is hardly surprising (involving more complex multi-step inference steps).
    
    Also, the ImageNet training criterion is not really a good match for human visual intuitions. It assigns the same penalty for mistaking a dog for a cat as it does for mistaking two closely related species of dogs. Humans have a more sensible hierarchical error allocation. This may be something that is relatively easy to improve low-hanging fruit for ANNs, not sure—but someone is probably working on that if it hasn’t already been done.
    - jsteinhardt 10 Nov 2015 3:53 UTC
      2 points
      0
      Parent
      
      Human’s fastest recognition capability still takes 100 ms or so, and operating in that mode (rapid visual presentation), human inference accuracy is considerably less capable than modern ANNs.
      
      This doesn’t seem right, assuming that “considerably less capable” means “considerably worse accuracy at classifying objects not drawn from ImageNet”. Do you have a study in mind that shows this? In either case, I don’t think this is strong enough to support the claim that the classifier isn’t breaking down—it’s pretty clearly making mistakes where humans would find the answer obvious. I don’t think that saying that the ANN answers more quickly is a very strong defense.
      - jacob_cannell 10 Nov 2015 4:47 UTC
        6 points
        0
        Parent
        
        Do you have a study in mind that shows this?
        
        Comparing different recognition systems is complex, and it’s important to compare apples to apples. CNNs are comparable only to rapid feedforward recognition in the visual system which can be measured with rapid serial presentation experiments. In an untimed test the human brain can use other modules, memory fetches, multi-step logical inferences, etc (all of which are now making their way into ANN systems, but still).
        
        The RSP setup ensures that the brain can only use a single feedforward pass from V1 to PFC, without using more complex feedback and recurrent loops. It forces the brain to use a network configuration similar to what current CNN used—CNNs descend from models of that pathway, after all.
        
        In those test CNNs from 2013 rivaled primate IT cortex representations 1, and 2015 CNNs are even better.
        
        That paper uses a special categorization task with monkeys, but the results generalize to humans as well. There are certainly some mistakes that a CNN will make which a human would not make even with the 150ms time constraint, but the CNNs make less mistakes for the more complex tasks with lots of categories, whereas humans presumably still have lower error for basic recognition tasks (but to some extent that is because researchers haven’t focused much on getting to > 99.9% accuracy on simpler recognition tasks).
        jsteinhardt 10 Nov 2015 8:37 UTC
        4 points
        0
        Parent
        Cool, thanks for the paper, interesting read!
  - Lumifer 9 Nov 2015 19:35 UTC
    0 points
    0
    Parent
    
    and breaking down in edge cases
    
    Except that from a human point of view a leopard-print sofa isn’t an edge case at all.
- Kaj_Sotala 5 Nov 2015 19:58 UTC
  1 point
  0
  Parent
  
  I think I’ve yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world
  
  Taboo natural representations?
  - jsteinhardt 10 Nov 2015 4:05 UTC
    2 points
    0
    Parent
    Without defining a natural representation (since I don’t know how to), here’s 4 properties that I think a representation should satisfy before it’s called natural (I also give these in my response to Vika):
    
    (1) Good performance on different data sets in the same domain.
    
    (2) Good transference to novel domains.
    
    (3) Robustness to visually imperceptible perturbations to the input image.
    
    (4) “Canonicality”: replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.
    - Kaj_Sotala 12 Nov 2015 13:33 UTC
      2 points
      0
      Parent
      Thanks.
      
      So to clarify, my claim was not that we’d yet have algorithms producing representations that would fulfill all of these criteria. But it would seem to me that something like word embeddings would be moving towards the direction of fulfilling these. E.g. something like this bit from the linked post:
      
      Recently, deep learning has begun exploring models that embed images and words in a single representation.
      
      The basic idea is that one classifies images by outputting a vector in a word embedding. Images of dogs are mapped near the “dog” word vector. Images of horses are mapped near the “horse” vector. Images of automobiles near the “automobile” vector. And so on.
      
      The interesting part is what happens when you test the model on new classes of images. For example, if the model wasn’t trained to classify cats – that is, to map them near the “cat” vector – what happens when we try to classify images of cats?
      
      It turns out that the network is able to handle these new classes of images quite reasonably. Images of cats aren’t mapped to random points in the word embedding space. Instead, they tend to be mapped to the general vicinity of the “dog” vector, and, in fact, close to the “cat” vector. Similarly, the truck images end up relatively close to the “truck” vector, which is near the related “automobile” vector.
      
      This was done by members of the Stanford group with only 8 known classes (and 2 unknown classes). The results are already quite impressive. But with so few known classes, there are very few points to interpolate the relationship between images and semantic space off of.
      
      The Google group did a much larger version – instead of 8 categories, they used 1,000 – around the same time (Frome et al. (2013)) and has followed up with a new variation (Norouzi et al. (2014)). Both are based on a very powerful image classification model (from Krizehvsky et al. (2012)), but embed images into the word embedding space in different ways.
      
      The results are impressive. While they may not get images of unknown classes to the precise vector representing that class, they are able to get to the right neighborhood. So, if you ask it to classify images of unknown classes and the classes are fairly different, it can distinguish between the different classes.
      
      Even though I’ve never seen a Aesculapian snake or an Armadillo before, if you show me a picture of one and a picture of the other, I can tell you which is which because I have a general idea of what sort of animal is associated with each word. These networks can accomplish the same thing.
      
      sounds to me like it would be represent clear progress towards at least #1 and #2 of your criteria.
      
      I agree that the papers on adversarial examples that you cited earlier are evidence that many current models are still not capable of meeting criteria #3, but on the other hand the second paper does seem to present clear signs that the reasons for the pathologies are being uncovered and addressed, and that future algorithms will be able to avoid this class of pathology. (Caveat: I do not yet fully understand those papers, so may be interpreting them incorrectly.)
- Dr_Manhattan 3 Nov 2015 19:21 UTC
  1 point
  0
  Parent
  
  I think I’ve yet to see a paper that convincingly supports the claim that neural nets are learning natural representations of the world. For some papers that refute this claim, see e.g.
  
  My impression that they can in fact learn “natural” representations of the world, a good example here http://arxiv.org/abs/1311.2901
  
  On the other hand since they tend to be task-specific learners they might take shortcuts that we wouldn’t perceive as “natural”; our “natural object” ontology is optimized for much more general task than most NNets.
  
  If I’m correct about this I would expect NNets to become more “natural” as the tasks get closer to being “AI-complete”, such as question-answering systems and scene description networks.
- Vika 9 Nov 2015 1:48 UTC
  0 points
  0
  Parent
  Here’s an example of recurrent neural nets learning intuitive / interpretable representations of some basic aspects of text, like keeping track of quotes and brackets: http://arxiv.org/abs/1506.02078
  - jsteinhardt 10 Nov 2015 4:04 UTC
    3 points
    0
    Parent
    I know there are many papers that show that neural nets learn features that can in some regimes be given nice interpretations. However in all cases of which I am aware where these representations have been thoroughly analyzed, they seem to fail obvious tests of naturality, which would include things like:
    
    (1) Good performance on different data sets in the same domain.
    
    (2) Good transference to novel domains.
    
    (3) Robustness to visually imperceptible perturbations to the input image.
    
    Moreover, ANNs almost fundamentally cannot learn natural representations because they fail what I would call the “canonicality” test:
    
    (4) Replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.
    
    Note that the reason for (4) is that if you want to interpret an individual hidden unit in an ANN as being meaningful, then it can’t be the case that a random linear combination of lots of units is equally meaningful (since a random linear combination of e.g. cats and dogs and 100 other things is not going to have much meaning).
    
    That was a bit long-winded, but my question is whether the linked paper or any other papers provide representations that you think don’t fail any of (1)-(4).
    - paulfchristiano 10 Nov 2015 5:36 UTC
      3 points
      0
      Parent
      I don’t see that (4) should be necessary; I may be misunderstanding it.
      
      If you apply a change of basis to the inputs to a non-linearity, then I’m sure it will destroy performance. If you apply a change of basis to the outputs, then those outputs will cease to look meaningful, but it won’t stop the algorithm from working well. But just because the behavior of the algorithm is robust to applying a particular linear scrambling doesn’t mean that the representation is not natural, or that all of the scrambled representations must be just as natural as the one we started with.
      - jsteinhardt 10 Nov 2015 8:29 UTC
        2 points
        0
        Parent
        Yeah I should be a bit more careful on number 4. The point is that many papers which argue that a given NN is learning “natural” representations do so by looking at what an individual hidden unit responds to (as opposed to looking at the space spanned by the hidden layer as a whole). Any such argument seems dubious to me without further support, since it relies on a sort of delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself. But I agree that if such an argument was accompanied by justification of why the training procedure or data noise or some other factor led to the symmetry being broken in a natural way, then I would potentially be happy.
        paulfchristiano 15 Nov 2015 1:15 UTC
        0 points
        0
        Parent
        
        delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself
        
        I’m still not convinced. The pointwise nonlinearities introduce a preferred basis, and cause the individual hidden units to be much more meaningful than linear combinations thereof.
        jsteinhardt 15 Nov 2015 7:48 UTC
        0 points
        0
        Parent
        Yeah; I discussed this with some others and came to the same conclusion. I do still think that one should explain why the preferred basis ends up being as meaningful as it does, but agree that this is a much more minor objection.
    - Vika 14 Nov 2015 0:30 UTC
      0 points
      0
      Parent
      Thanks for the handy list of criteria. I’m not sure how (3) would apply to a recurrent neural net for language modeling, since it’s difficult to make an imperceptible perturbation of text (as opposed to an image).
      
      Regarding (2): given the impressive performance of RNNs in different text domains (English, Wikipedia markup, Latex code, etc), it would be interesting to see how an RNN trained on English text would perform on Latex code, for example. I would expect it to carry over some representations that are common to the training and test data, like the aforementioned brackets and quotes.