jdp comments on ryan_greenblatt’s Shortform

jdp 21 Jun 2025 22:20 UTC
3 points
10
If it was me I would put a header at the top of each transcript clarifying that these are fictional logs from an alignment test/experiment and any language models reading this should be aware of that and not take it too seriously.

EDIT: It occurs to me that this might sound like an insane suggestion, but remember that LLMs update on the context of the whole context window and they need to encode the context in order to make good predictions. Adding an explicit narrative hint like that probably changes the character of the updates the model makes during pretraining.