I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?
I think this decomposes into two questions: 1) does the amplification process, given humans/trained agents solve the problem in a generalizable way (ie. would HCH solve the problem correctly)? 2) Does this generalizability break during the distillation process? (I’m not quite sure which you’re pointing at here).
For the amplification process, I think it would deal with things in an appropriately generalizable way. You are doing something a bit more like training the agents to form nodes in a decision tree that captures all of the important questions you would need to figure out what to do next, including components that examine the situation in detail. Paul has written up an example of what amplification might look like, that I think helped me to understand the level of abstraction that things are working at. The claim then is that expanding the decision tree captures all of the relevant considerations (possibly at some abstract level, ie. instead of capturing considerations directly it captures the thing that generates those considerations), and so will properly generalize to a new decision.
I’m less sure at this point about how well distillation would work, in my understanding this might require providing some kind of continual supervision (if the trained agent goes into a sufficiently new input domain, then it requests more labels on this new input domain from it’s overseer), or might be something Paul expects to fall out of informed oversight + corrigibility?