abramdemski comments on A shot at the diamond-alignment problem

abramdemski 9 May 2023 17:17 UTC
LW: 17 AF: 14
2
AF
Attempting to write out the holes in my model.
- You point out that looking for a perfect reward function is too hard; optimization searches for upward errors in the rewards to exploit. But you then propose an RL scheme. It seems to me like it’s still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
- It seems like you have a few tools to combat this form of critique:
  - Model capacity. If the policy that exploits the upward errors is too complex to fit in the model, it cannot be learned. Or, more refined: if the proposed exploit-policy has features that make it difficult to learn (whether complexity, or some other special features).
    I don’t think you ever invoke this in your story, and I guess maybe you don’t even want to, because it seems hard to separate the desired learned behavior from undesired learned behavior via this kind of argument.
  - Path-dependency. What is learned early in training might have an influence on what is learned later in training.
    It seems to me like this is your main tool, which you want to invoke repeatedly during your story.
- I am very uncertain about how path-dependency works.
  - It seems to me like this will have a very large effect on smaller neural networks, but a vanishing effect on larger and larger neural networks, because as neural networks get large, the gradient updates get smaller and smaller, and stay in a region of parameter-space where loss is more approximately linear. This means that much larger networks have much less path-dependency.
    Major caveat: the amount of data is usually scaled up with the size of the network. Perhaps the overall amount of path-dependency is preserved.
    This contradicts some parts of your story, where you mention that not too much data is needed due to pre-training. However, you did warn that you were not trying to project confidence about story details. Perhaps lots of data is required in these phases to overcome the linearity of large-model updates, IE, to get the path-dependent effects you are looking for. Or perhaps tweaking step size is sufficient.
  - The exact nature of path-dependence seems very unclear.
    In the shard language, it seems like one of the major path-dependency assumptions for this story is that gradient descent (GD) tends to elaborate successful shards rather than promote all relevant shards.
    It seems unclear why this would happen in any one step of GD. At each neuron, all inputs which would have contributed to the desired result get strengthened.
    My model of why very large NNs generalize well is that they effectively approximate bayesian learning, hedging their bets by promoting all relevant hypotheses rather than just one.
    An alternate interpretation of your story is that ineffective subnetworks are basically scavenged for parts by other subnetworks early in training, so that later on, the ineffective subnetworks don’t even have a chance. This effect could compound on itself as the remaining not-winning subnetworks have less room to grow (less surrounding stuff to scavenge to improve their own effective representation capacity).
    Shards that work well end up increasing their surface area, and surface area determines a kind of learning rate for a shard (via the shard’s ability to bring other subnetworks under its gradient influence), so there is a compounding effect.
    Another major path-dependency assumption seems to be that as these successful shards develop, they tend to have a kind of value-preservation. (Relevant comment.)
    EG, you might start out with very simple shards that activate when diamonds are visible and vote to walk directly toward them. These might elaborate to do path-planning toward visible diamonds, and then to do path-planning to any diamonds which are present in the world-model, and so on, upwards in sophistication. So you go from some learned behavior that could be anthropomorphized as diamond-seeking, to eventually having a highly rational/intelligent/capable shard which really does want diamonds.
    Again, I’m unclear on whether this can be expected to happen in very large networks, due to the lottery ticket hypothesis. But assuming some path-dependency, it is unclear to me whether it will word like this.
    The claim would of course be very significant if true. It’s a different framework, but effectively, this is a claim about ontological shifts—as you more-or-less flag in your major open questions.
    While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.
So, it seems to me, the most important research questions to figure out if a plan like this could be feasible revolve around the nature and existence of path-dependencies in GD, especially for very large NNs.
- TurnTrout 1 Jun 2023 20:06 UTC
  LW: 2 AF: 2
  0
  AF Parent
  (Huh, I never saw this—maybe my weekly batched updates are glitched? I only saw this because I was on your profile for some other reason.)
  I really appreciate these thoughts!
  But you then propose an RL scheme. It seems to me like it’s still a useful form of critique to say: here are the upward errors in the proposed rewards, here is the policy that would exploit them.
  I would say “that isn’t how on-policy RL works; it doesn’t just intelligently find increasingly high-reinforcement policies; which reinforcement events get ‘exploited’ depends on the exploration policy.” (You seem to guess that this is my response in the next sub-bullets.)
  While I find the particular examples intuitive, the overall claim seems too good to be true: effectively, that the path-dependencies which differentiate GD learning from ideal Bayesian learning are exactly the tool we need for alignment.
  shrug, too good to be true isn’t a causal reason for it to not work, of course, and I don’t see something suspicious in the correlations. Effective learning algorithms may indeed have nice properties we want, especially if some humans have those same nice properties due to their own effective learning algorithms!