ryan_greenblatt comments on Florian_Dietz’s Shortform

ryan_greenblatt 26 May 2025 13:19 UTC
14 points
11
Smart LLMs can often/typically tell from my understanding, but I don’t have a citation.

In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).

It’s unclear how it alters the AI’s behavior (at the current level of capability) when you do this.

Naively, an AI can’t use just the binary “did they put words in my mouth” to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.

(If you’re training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the “words in the mouth” are just inputs that the AI isn’t trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)