Nathan Helm-Burger comments on Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

Nathan Helm-Burger 28 Nov 2023 23:42 UTC
4 points
0
And I’m not Daniel K., but I do want to respond to you here Ryan. I think that the world I foresee is one in which there will huge tempting power gains which become obviously available to anyone willing to engage in something like RL-training their personal LLM agent (or other method of instilling additional goal-pursuing-power into it). I expect that some point in the future the tech will change and this opportunity will become widely available, and some early adopters will begin benefiting in highly visible ways. If that future comes to pass, then I expect the world to go ‘off the rails’ because these LLMs will have correlated-but-not-equivalent goals and will become increasingly powerful (because one of the goals they get set will be to create more powerful agents).
I don’t think that’s that only way things go badly in the future, but I think it’s an important danger we need to be on guard against. Thus, I think that a crux between you and I is that I think that there is a strong reason to believe that the ‘if we did a bunch of RL’ is actually a quite likely scenario. I believe it is inherently an attractor-state.
- ryan_greenblatt 29 Nov 2023 1:25 UTC
  4 points
  0
  Parent
  To clarify I don’t think that LLM agents are necessarily or obviously safe. I was just trying to argue that it’s plausible that they could achieve long terms objectives while also not having “wanting” in the sense necessary for (some) AI risk arguments to go through. (edited earlier comment to make this more clear)
  - Nathan Helm-Burger 29 Nov 2023 3:05 UTC
    2 points
    0
    Parent
    Thanks for the clarification!