Intuitively I would expect that such LLM outputs (especially ones labeled as LLM outputs, although to some extent models can recognize their own output) are too few to provide a comprehensive picture of baseline behavior.
On the other hand maybe it’s comparing to documents in general and recognizing that it’s rare for them to follow an acrostic form? That seems at least somewhat plausible, although maybe less plausible for the experiments in ‘Language Models Can Articulate Their Implicit Goals’, since those behaviors are less unusual than producing acrostics—eg the training data presumably contains a range of risk-seeking vs risk-averse behaviors.
Intuitively I would expect that such LLM outputs (especially ones labeled as LLM outputs, although to some extent models can recognize their own output) are too few to provide a comprehensive picture of baseline behavior.
On the other hand maybe it’s comparing to documents in general and recognizing that it’s rare for them to follow an acrostic form? That seems at least somewhat plausible, although maybe less plausible for the experiments in ‘Language Models Can Articulate Their Implicit Goals’, since those behaviors are less unusual than producing acrostics—eg the training data presumably contains a range of risk-seeking vs risk-averse behaviors.