> Paradoxically, when interrogated, it hallucinates a confession to a hidden objective it never actually followed.
When working on lie detection we found that even the off-the-shelf Llama 70B model falsely confesses to misaligned objectives, such as being trained to sandbag or having a hidden goal (see Appendix I). In general this model seems regularly confused and poor at introspection.
> Changing LoRA rank does not help
It’s interesting that the SSC execution rate never gets beyond 50%, even before SRFT. Do you have a sense for why this is? Also did you bump alpha up when increasing rank (e.g., with rsLoRA)?
Thanks for writing this!
> Paradoxically, when interrogated, it hallucinates a confession to a hidden objective it never actually followed.
When working on lie detection we found that even the off-the-shelf Llama 70B model falsely confesses to misaligned objectives, such as being trained to sandbag or having a hidden goal (see Appendix I). In general this model seems regularly confused and poor at introspection.
> Changing LoRA rank does not help
It’s interesting that the SSC execution rate never gets beyond 50%, even before SRFT. Do you have a sense for why this is? Also did you bump alpha up when increasing rank (e.g., with rsLoRA)?