I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:
Added 2 sentences emphasizing the point that schemers probably won’t be aware of their terminal goal in most contexts. I thought this was clear from the post already, but apparently it wasn’t.
Modified “What factors affect the likelihood of training-gaming?” to emphasize that “sum of proxies” and “reward-seeker” are points on a spectrum. We might get an in-between model where context-dependent drives conflict with higher goals and sometimes “win” even outside the settings they are well-adapted to. I also added a footnote about this to “Characterizing training-gamers and proxy-aligned models”.
Edited the “core claims” section (e.g. softening a claim and adding content).
Changed “reward seeker” to “training-gamer” in a bunch of places where I was using it to refer to both terminal reward seekers and schemers.
Miscellaneous small changes
Shouldn’t the second one be 1√k?
Is this meant to say “last token” instead of “past token”?