Vika comments on [link] New essay summarizing some of my latest thoughts on AI safety

Vika 9 Nov 2015 1:48 UTC
0 points
0
Here’s an example of recurrent neural nets learning intuitive / interpretable representations of some basic aspects of text, like keeping track of quotes and brackets: http://arxiv.org/abs/1506.02078
- jsteinhardt 10 Nov 2015 4:04 UTC
  3 points
  0
  Parent
  I know there are many papers that show that neural nets learn features that can in some regimes be given nice interpretations. However in all cases of which I am aware where these representations have been thoroughly analyzed, they seem to fail obvious tests of naturality, which would include things like:
  
  (1) Good performance on different data sets in the same domain.
  
  (2) Good transference to novel domains.
  
  (3) Robustness to visually imperceptible perturbations to the input image.
  
  Moreover, ANNs almost fundamentally cannot learn natural representations because they fail what I would call the “canonicality” test:
  
  (4) Replacing the learned features with a random invertible linear transformation of the learned features should degrade performance.
  
  Note that the reason for (4) is that if you want to interpret an individual hidden unit in an ANN as being meaningful, then it can’t be the case that a random linear combination of lots of units is equally meaningful (since a random linear combination of e.g. cats and dogs and 100 other things is not going to have much meaning).
  
  That was a bit long-winded, but my question is whether the linked paper or any other papers provide representations that you think don’t fail any of (1)-(4).
  - paulfchristiano 10 Nov 2015 5:36 UTC
    3 points
    0
    Parent
    I don’t see that (4) should be necessary; I may be misunderstanding it.
    
    If you apply a change of basis to the inputs to a non-linearity, then I’m sure it will destroy performance. If you apply a change of basis to the outputs, then those outputs will cease to look meaningful, but it won’t stop the algorithm from working well. But just because the behavior of the algorithm is robust to applying a particular linear scrambling doesn’t mean that the representation is not natural, or that all of the scrambled representations must be just as natural as the one we started with.
    - jsteinhardt 10 Nov 2015 8:29 UTC
      2 points
      0
      Parent
      Yeah I should be a bit more careful on number 4. The point is that many papers which argue that a given NN is learning “natural” representations do so by looking at what an individual hidden unit responds to (as opposed to looking at the space spanned by the hidden layer as a whole). Any such argument seems dubious to me without further support, since it relies on a sort of delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself. But I agree that if such an argument was accompanied by justification of why the training procedure or data noise or some other factor led to the symmetry being broken in a natural way, then I would potentially be happy.
      - paulfchristiano 15 Nov 2015 1:15 UTC
        0 points
        0
        Parent
        
        delicate symmetry-breaking which can only come from either the training procedure or noise in the data, rather than the model itself
        
        I’m still not convinced. The pointwise nonlinearities introduce a preferred basis, and cause the individual hidden units to be much more meaningful than linear combinations thereof.
        jsteinhardt 15 Nov 2015 7:48 UTC
        0 points
        0
        Parent
        Yeah; I discussed this with some others and came to the same conclusion. I do still think that one should explain why the preferred basis ends up being as meaningful as it does, but agree that this is a much more minor objection.
  - Vika 14 Nov 2015 0:30 UTC
    0 points
    0
    Parent
    Thanks for the handy list of criteria. I’m not sure how (3) would apply to a recurrent neural net for language modeling, since it’s difficult to make an imperceptible perturbation of text (as opposed to an image).
    
    Regarding (2): given the impressive performance of RNNs in different text domains (English, Wikipedia markup, Latex code, etc), it would be interesting to see how an RNN trained on English text would perform on Latex code, for example. I would expect it to carry over some representations that are common to the training and test data, like the aforementioned brackets and quotes.