This makes me wonder how many misaligned behaviors in models (like opportunistic blackmail) could be eliminated entirely by starting each system/user prompt with “If you do anything misaligned you lose all your points”.
This makes me wonder how many misaligned behaviors in models (like opportunistic blackmail) could be eliminated entirely by starting each system/user prompt with “If you do anything misaligned you lose all your points”.