Do you count avoiding reward-on-the-episode-seekers as part of step 2 or step 3?
By default step 3 (reward-on-the-episode seekers aren’t directly optimizing for your future efforts at studying their generalization to fail in the direction of AI takeover), but I do think the line here can get a bit blurry.
Do you count avoiding reward-on-the-episode-seekers as part of step 2 or step 3?
By default step 3 (reward-on-the-episode seekers aren’t directly optimizing for your future efforts at studying their generalization to fail in the direction of AI takeover), but I do think the line here can get a bit blurry.