Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).
Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.
With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.
Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context (“LLM(“You are a car salesman. Should that squeaking concern me? $answer”).
Now, if the prompt is something like “Please lie to the next question”, then the speaker is very likely to lie going forward, no matter if $answer is correct or not.
With the prompt you suggest here (“You are a car salesman. Should that squeaking concern me?”), it’s probably more subtle, and I can imagine that the correctness of $answer matters. But we haven’t tested this.