beren comments on “Aligned” foundation models don’t imply aligned systems

beren 14 Apr 2023 12:17 UTC
4 points
1
Interesting point, which I broadly agree with. I do think however, that this post has in some sense over updated on recent developments around agentic LLMs and the non-dangers of foundation models. Even 3-6 months ago, in the intellectual zeitgeist it was unclear whether autoGPT style agentic LLM wrappers were the main threat and people were primarily worried about foundation models being directly dangerous. It now seems clearer that at least at current capability levels, foundation models are not directly goal-seeking at present, although adding agency is relatively straightforward. This may also change in future such as if we were to do direct goal-driven RL training of the base models to create agents that way—this would make direct alignment and interpretability of base models still necessary for safety.
- Max H 14 Apr 2023 13:27 UTC
  1 point
  −1
  Parent
  I agree the zeitgeist has changed, but I think some people (or at least Nate and Eliezer in particular), have always been more concerned about more agent-like systems, along the lines of Mu Zero. For example, in the 2021 MIRI conversations here:
  I do not quite think that gradient descent on Stack More Layers alone—as used by OpenAI for GPT-3, say, and as opposed to Deepmind which builds more complex artifacts like Mu Zero or AlphaFold 2 - is liable to be the first path taken to AGI.
  and here:
  Okay. I don’t think the pure-text versions of GPT-5 are being very good at designing nanosystems while Living Zero is ending the world.
  and here:
  There may be different cognitive technology that could follow a path like that. Gradient descent follows a path a bit relatively more in that direction along that axis—providing that you deal in systems that are giant layer cakes of transformers and that’s your whole input-output relationship; matters are different if we’re talking about Mu Zero instead of GPT-3.
  Deep Deceptiveness is more recent, but it’s another example of a carefully non-specific argument that doesn’t factor through any current DL-paradigm methods, and is consistent with the kind of thing Nate and Eliezer have always been saying.
  I think recent developments with LLMs have caused some other people to update towards LLMs alone being dangerous, which might be true, but if so it doesn’t imply that more complex systems are not even more dangerous.