TurnTrout comments on The alignment problem in different capability regimes

TurnTrout 10 Sep 2021 0:33 UTC
LW: 11 AF: 9
0
AF
Other examples of problems that people sometimes call alignment problems that aren’t a problem in the limit of competence: avoiding negative side effects, safe exploration...
I don’t understand why you think that negative side effect avoidance belongs on that list.
A sufficiently intelligent system will probably be able to figure out when it’s having negative side effects. This does not mean that it will—as a matter of fact—avoid having these side effects, and it does not mean that its NegativeSideEffect? predicate is accessible. A paperclip maximizer may realize that humans consider extinction to be a “negative side effect.” This consideration does not move it. Increasing agent intelligence does not naturally solve the problem of getting the agent to not do catastrophically impactful things while optimizing its objective.
In contrast, once an agent realizes that an exploration strategy is unsafe, the agent will be instrumentally motivated to find a better one. Increasing agent intelligence naturally solves the problem of safe exploration.
it will massively outperform humans on writing ethics papers or highly upvoted r/AmItheAsshole comments.
Presumably you meant to say “it will be able to massively outperform...”? (I think you did, since you mention a similar consideration under “Ability to understand itself.”) A competent agent will understand, but will only act accordingly if so aligned (for either instrumental or terminal reasons).
- Buck 10 Sep 2021 15:23 UTC
  LW: 16 AF: 11
  0
  AF Parent
  Re the negative side effect avoidance: Yep, you’re basically right, I’ve removed side effect avoidance from that list.
  And you’re right, I did mean “it will be able to” rather than “it will”; edited.
- adamShimi 10 Sep 2021 9:35 UTC
  LW: 2 AF: 1
  0
  AF Parent
  That was my reaction when reading the competence subsection too. I’m really confused, because that’s quite basic Orthogonality Thesis, so should be quite obvious to the OP. Maybe it’s a problem of how the post was written that implies some things the OP didn’t meant?