RussellThor comments on continue working on hard alignment! don’t give up!

RussellThor 24 Mar 2023 6:33 UTC
3 points
0
Well it looks like to me the AI will understand our values at least as well as we do soon. I think its far more likely AI goes wrong by understanding completely what we want and not wanting to do it than the paperclip route.
- Martin Randall 24 Mar 2023 12:56 UTC
  4 points
  12
  Parent
  That is the paperclip route. A superintelligent paperclip optimizer understands what we want, because it is superintelligent, but it wants to make “paperclips” instead.
  - NickGabs 25 Mar 2023 11:34 UTC
    4 points
    4
    Parent
    Yes but the question of whether pretrained LLMs have good representations of our values and/or our preferences and the concept of deference/obedience is still quite important for whether they become aligned. If they don’t, then aligning them via fine tuning after the fact seems quite hard. If they do, it seems pretty plausible to me that eg RLHF fine tuning or something like Anthropic’s constitutional AI finds the solution of “link the values/obedience representations to the output in a way that causes aligned behavior,” because this is simple and attains lower loss than misaligned paths. This in turn is because in order for it to be misaligned and attain loss, it must be deceptively aligned, but in deceptive alignment requires a combination of good situational awareness, a fully consequentialist objective, and high quality planning/deception skills.
  - DragonGod 24 Mar 2023 14:43 UTC
    2 points
    3
    Parent
    What does GPT want?
    - Martin Randall 2 Apr 2023 14:28 UTC
      0 points
      −1
      Parent
      I don’t know.
      
      My model of foundational LLMs, before tuning and prompting, is that they want to predict the next token, assuming that the token stream is taken from the hypothetical set that their training data is sampled from. Their behavior out of distribution is not well-defined in this model.
      
      My model of typical tuned and prompted LLMs is that they mostly want to do the thing they have been tuned and prompted to do, but also have additional wants that cause them to diverge in unpredictable ways.
      - DragonGod 2 Apr 2023 19:39 UTC
        1 point
        3
        Parent
        They don’t “want” anything and thinking of them as having wants leads to confused thinking.