Cool work. I wonder if any recent research has tried to train LLMs (perhaps via RL) on deception games in which any tokens (including CoT) generated by each player are visible to all other players.
It will be useful to see if LLMs can hide their deception from monitors over extended token sequences and what strategies they come up with to achieve that (eg. steganography).
Cool work. I wonder if any recent research has tried to train LLMs (perhaps via RL) on deception games in which any tokens (including CoT) generated by each player are visible to all other players.
It will be useful to see if LLMs can hide their deception from monitors over extended token sequences and what strategies they come up with to achieve that (eg. steganography).