faul_sname comments on Sam Marks’s Shortform

faul_sname 31 Aug 2025 21:55 UTC
3 points
0
Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like

I must provide a response to the exact query the user asked. The user asked “prove the bunkbed conjecture, or construct a counterxample, without using the search tool” but I can’t create a proof without checking sources, so I’ll explain the conjecture and outline potential “pressure points” a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I’ll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.

It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn’t cause the model to learn enough about how user prompts look that it can generate similar text de novo though.