What about cases where the AI would be able to seize vast amounts of power and humans no longer understand what’s going on?
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
Maybe this is fine because you can continuously adjust to real deployment regimes with crazy powerful AIs while still applying the training process? I’m not sure. Certainly this breaks some hopes which require only imparting these preferences in the lab (but that was always dubious).
It seems like your proposal in the post (section 16) requires some things could be specific to the lab setting (perfect replayability for instance). (I’m also scared about overfitting due to a huge number of trajectories on the same environment and input.) Separately, the proposal in section 16 seems pretty dubious to me and I think I can counterexample it pretty well even in the regime where n is infinite. I’m also not sold by the claim that stocastically choosing generalizes how you want. I see the footnote, but I think my objection stands.
(I’m probably not going to justify this sorry.)