“What would the Assistant do?” (according to the beliefs of the post-trained LLM simulating the Assistant).
nit: I think you’re slightly falling foul of the anthropomorphism you’re aiming to avoid, here. Rather, ‘according to the post-trained LLMs’s posterior over Assistant personas (further conditioned by the preceding context)’.
More generally, a license to consider some things somewhat anthropomorphically (i.e. Assistant personas under certain training regimes) needs to be handled very carefully given known human reasoning biases here!
You’re maybe falling into a similar trap in the AI welfare section (perhaps conflating LM and Assistant and Assistant-instance beliefs?) but it’s pretty tricky territory!
nit: I think you’re slightly falling foul of the anthropomorphism you’re aiming to avoid, here. Rather, ‘according to the post-trained LLMs’s posterior over Assistant personas (further conditioned by the preceding context)’.
More generally, a license to consider some things somewhat anthropomorphically (i.e. Assistant personas under certain training regimes) needs to be handled very carefully given known human reasoning biases here!
You’re maybe falling into a similar trap in the AI welfare section (perhaps conflating LM and Assistant and Assistant-instance beliefs?) but it’s pretty tricky territory!