Anticorrelated Noise Injection for Improved Generalization

tailcalled20 Feb 2022 10:15 UTC

2 points

Just a study I saw on /r/MachineLearning: link.

Basically, one way of training neural networks is to add random noise during the training. Usually, the noise that gets added is independent between the training steps, but in the paper they make it negatively correlated between the steps, and argue that this helps with the generalization of the networks because it moves them towards flatter minima.

This seems conceptually related to things that have been discussed on LessWrong, e.g. the observations by John Wentworth that search tends to lead to flat minima, which may have beneficial properties.

I would have liked to see them test this on harder problems than the ones they used, and/or on a greater variety of real-world problems.

tailcalled20 Feb 2022 10:15 UTC

2 points

9 comments1 min readLW link

Machine Learning (ML)World Modeling

Gunnar_Zarncke 20 Feb 2022 10:35 UTC
3 points
Normally, when I compare ML training methods with parenting or education methods I have an idea of what the correspondence one is but here I am a bit of a loss. Telling selective lies that can later be found to be wrong—like Santa Claus?
- lise 21 Feb 2022 11:12 UTC
  3 points
  Parent
  If you have a list of such correspondences somewhere, I’d like to see it!
  - Gunnar_Zarncke 22 Feb 2022 17:30 UTC
    4 points
    Parent
    Here are some suggested correspondences:
    pretraining—is what humans do by default: Start training a model in a simple domain and then train on top of that model in broader and broader domains.
    drop-out seems to be related to deliberate play
    noise—you can sometimes give wrong answers or say “no” when they offer answers even if the answer is right to check how confident they are (of course giving the true answer immediately after)
    zone of proximal development seems related to hyperparameters for the learning rate and size of the current model in relation to the size of the domain
    explore-exploit has obvious analogs
    Not exactly parenting but performance evaluation also has analogs too, and goodharting is common. Many of my counter-measure against goodharting are informed by ML:
    high-dimensionality—many features allow to model the target closely
    but can lead to overfitting, thus
    drop-out—you let people know that a few randomly chosen features will be dropped from evaluation, so it doesn’t make sense to put all your optimization on a few targets.
    altering the performance evaluation regimes each year—while still aiming at roughly the same thing.
    noise—leave the exact weights used unclear.
    - lise 24 Feb 2022 11:17 UTC
      3 points
      Parent
      Thanks for writing it up! I don’t know if I buy the human caregiver model, as OP said above, but I do like this way of thinking about it. Esp. the zone of proximal development thing is interesting, and for some reason I hadn’t thought about performance evaluation analogies before even though the correspondence is quite clear. Much food for thought.
  - Gunnar_Zarncke 21 Feb 2022 14:03 UTC
    3 points
    Parent
    I keep saying that parenting is a useful source of inspiration and insight into ML training and alignment methods but so far few people seemed to believe me. Happy to hear that you are interested. I will write up some correspondences.
    - tailcalled 21 Feb 2022 15:40 UTC
      4 points
      Parent
      I agree that it may be a useful source of insight, since it may tell you some learning techniques like this one, but I find it unlikely that this will end up involving giving the AI a “human caregiver”.
      - Gunnar_Zarncke 22 Feb 2022 17:03 UTC
        2 points
        Parent
        Maybe it is not the most likely scenario but a lot of mediocre AIs trained “on the job” in a close loop with humans that are not just overseers but provide a lot of real-world context doesn’t seem so unlikely in a Robin Hanson style slow takeoff.
- tailcalled 20 Feb 2022 10:49 UTC
  3 points
  Parent
  Good question, I don’t know the answer.
  I suspect telling selective lies that can later be found to be wrong is not the right analogy for adding noise, because this seems to me to be adding noise to the labels rather than to the gradients. I’d assume if you add noise to the labels, that could be counterproductive because the resulting gradients would be propagated disproportionately to all of the most functional and predictive parts of the network, destroying them in favor of flexibility.
  I think equation 3 in the paper maybe suggests one way to view understand it: When practicing, rather than going with the way you have learned to do things, you should make random temporary changes to the process in which you do them.
  - Gunnar_Zarncke 20 Feb 2022 11:09 UTC
    2 points
    Parent
    
    When practicing, rather than going with the way you have learned to do things, you should make random temporary changes to the process in which you do them.
    
    That sounds a lot like Deliberate Play (or Deliberate Practice?) and kids do it a lot (both).