evhub comments on Discovering Language Model Behaviors with Model-Written Evaluations

evhub 27 Dec 2022 3:57 UTC
LW: 10 AF: 7
1
AF
The fact that large models interpret the “HHH Assistant” as such a character is interesting

Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:
1. what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
2. because I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
- cfoster0 27 Dec 2022 4:39 UTC
  6 points
  4
  Parent
  Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic.
  [...]
  I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
  I want to flag that I don’t know what it would mean for a “helpful, honest, harmless” assistant / character to be non-agentic (or for it to be no more agentic than the pretrained LM it was initialized from).
  EDIT: this applies to the OP and the originating comment in this thread as well, in addition to the parent comment
  - Roman Leventov 27 Dec 2022 7:55 UTC
    2 points
    1
    Parent
    Agreed. To be consistently “helpful, honest, and harmless”, LLM should somehow “keep this on the back of its mind” when it assists the person, or else it risks violating these desiderata.
    In DNN LLMs, “keeping something in the back of the mind” is equivalent to activating the corresponding feature (of “HHH assistant”, in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.