Daniel Tan comments on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning

Daniel Tan 19 Jan 2025 2:41 UTC
1 point
0
Language models learning to spell things out using the first letter of their generations could be one pathway towards subtle obfuscated reasoning (ie steganography).
Eg here an LM spells out HELLO; it would be concerning if LMs could spell out other things like “Deployment” or “Not deployment”.