abramdemski comments on The case for aligning narrowly superhuman models

abramdemski 14 Mar 2021 23:13 UTC
LW: 4 AF: 3
0
AF
I do see a post with a formal result, which seems like a direct contradiction to what you’re saying, though I’ll look in more detail.
If you mean to suggest this post has a positive result, then I think you’re just mis-reading it; the key result is
The conclusion of this post is the following: if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy, and there is some task for which that machine learning results in deceptive behavior, then there exists a natural task such that the minimal circuit that solves that task also produces deceptive behavior.
which says that under some assumptions, there exists a task for which the minimal circuit will engage in deceptive behavior (IE is a malign inner optimizer).
The comment with a counterexample on the original post is here.
- magfrump 15 Mar 2021 6:14 UTC
  2 points
  0
  Parent
  I see, I definitely didn’t read that closely enough.