Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).
Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.
Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).
Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.