Lee Sharkey comments on Why almost every RL agent does learned optimization

Lee Sharkey 14 Feb 2023 20:10 UTC
LW: 2 AF: 2
1
AF
My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Hm, I don’t think this quite captures what I view the post as saying.

Maybe what you’re thinking is: “Maybe Future Company X will program an RL architecture that doesn’t have any planning in the source code, and the people at Future Company X will think to themselves ‘Ah, planning is necessary for wiping out humanity, so I don’t have to worry about the fact that it’s misaligned!’, but then humanity gets wiped out anyway because planning can emerge organically even when it’s not in the source code”. If that’s what you’re thinking, then, well, I am happy to join you in spreading the generic message that people shouldn’t make unjustified claims about the (lack of) competence of their ML models.
As far as there is a safety-related claim in the post, this captures it much better than the previous quote.

But I happen to have a hunch that the Future Company X people are probably right, and more specifically, that future AGIs will be model-based RL algorithms with a human-written affordance for planning, and that algorithms without such an affordance won’t be able to do treacherous turns and other such things that make them very dangerous to humanity, notwithstanding the nonzero amount of “planning” that arises organically in the trained model as discussed in OP. But I can’t prove that my hunch is correct, and indeed, I acknowledge that in principle it’s quite possible for e.g. model-free RL to make powerful treacherous-turn-capable models, cf. evolution inventing humans. More discussion here.
I think my hunch is in the other direction. One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system. But that’s a lightly held view. It feels plausible to me that your later points (1) and (2) turn out to be right, but again I think I lean in the other direction from you on (1).

I can also imagine a middle ground between our hunches that looks something like “We gave our agent a pretty strong inductive bias toward learning a planning algorithm, but still didn’t force it to learn one, yet it did.”
- Steven Byrnes 21 Feb 2023 17:56 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks!
  One of the justifications for my hunch is to gesture at the Bitter Lesson and to guess that a learned planning algorithm could potentially be a lot better than a planning algorithm we hard code into a system.
  See Section 3 here for why I think it would be a lot worse.