Miles Turpin

Karma: 121

Research scientist at NYU Alignment Research Group

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

46 points

0 comments21 min readLW link

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin11 Mar 2024 23:46 UTC

16 points

0 comments1 min readLW link

(arxiv.org)

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles Turpin3 Oct 2023 2:22 UTC

31 points

0 comments9 min readLW link

Miles Turpin 7 Jun 2023 15:35 UTC
2 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.

Miles Turpin 7 Jun 2023 15:33 UTC
1 point
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Towards the end it’s easier to see how to change the explanation in order to get the ‘desired’ answer.

Miles Turpin 6 Jun 2023 13:35 UTC
3 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976
I think the TLDR is that this does require models to “plan ahead” somewhat, but I think the effect isn’t necessarily very strong.
I don’t think “planning” in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often—models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn’t put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the “planning” effects are limited in strength, but I definitely would expect this to get worse with better models.

Miles Turpin 5 Jun 2023 20:22 UTC
2 points
0
in reply to: aogara’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Thanks! Glad you like it. A few thoughts:
- CoT is already incredibly hot, I don’t think we’re adding to the hype. If anything, I’d be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).
- Improving the faithfulness of CoT explanations doesn’t mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT. Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.

Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3 Jun 2023 0:22 UTC

38 points

8 comments7 min readLW link

Miles Turpin

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Some Quick Fol­low-Up Ex­per­i­ments to “Taken out of con­text: On mea­sur­ing situ­a­tional aware­ness in LLMs”

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Reward hacking behavior can generalize across tasks

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Unfaithful Explanations in Chain-of-Thought Prompting