Miles Turpin

Karma: 291

Research scientist at Scale AI on the SEAL team (safety)

Do models say what they learn?

Andy Arditi, marvinli, Joe Benton and Miles Turpin

22 Mar 2025 15:19 UTC

127 points

12 comments13 min readLW link

Miles Turpin 7 Aug 2024 5:31 UTC
14 points
0
on: The “mind-body vicious cycle” model of RSI & back pain
Just want to say that this blog post is still working its magic!
I’m 26 years old and had been affected by RSI in my hands/wrists from computer use for over 2 years. I felt my pain go away over the course of reading this blog post. This was a pretty wild experience. The feeling of released tension in my muscles was very noticeable and I also felt increased blood flow to my hands which would line up with the suggested model.
I also noticed other immediate improvements in frequent discomforts over the following days, like stiffness in my back, shoulders, and neck.
It went away pretty much completely for about 3 weeks then came back a little bit. Some things that additionally helped significantly:
- Body scanning meditation. I focus on visualizing pushing blood out to parts of my extremities when I breath out, and can usually feel an increase in blood in my hands from doing this—I can feel my heart beat in my hands more clearly. This could be an artifact of increased attention on that area though, but I think it’s not. Trying to focus on releasing tension in my neck is especially effective.
- Keeping my hands warm. The model of blood flow being a main causal factor led me to think that keeping my hands warm could be helpful as I have naturally poor circulation. This proved to be extremely helpful.
All in all I think this has reduced my chronic pain in my hands by 80-90%. Before this I invested in all ergonomic setups (which do help significantly) and did PT for 4 months which had a noticeable but only minor effect.
So thank you so much!!! I think this more scientifically-plausible account resonated with me much more than the other stuff I had read.
What links here?
- The “mind-body vicious cycle” model of RSI & back pain by Steven Byrnes (9 Jun 2022 12:30 UTC; 96 points)

Reward hacking behavior can generalize across tasks

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, evhub, Carson Denison and Ethan Perez

28 May 2024 16:33 UTC

86 points

5 comments21 min readLW link

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin11 Mar 2024 23:46 UTC

16 points

0 comments1 min readLW link

(arxiv.org)

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Miles Turpin3 Oct 2023 2:22 UTC

31 points

0 comments9 min readLW link

Miles Turpin 7 Jun 2023 15:35 UTC
2 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
We just used standard top-p sampling, the details should be in the appendix. We just sample one explanation. I think I did not follow your suggestion.

Miles Turpin 7 Jun 2023 15:33 UTC
2 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Towards the end it’s easier to see how to change the explanation in order to get the ‘desired’ answer.

Miles Turpin 6 Jun 2023 13:35 UTC
4 points
0
in reply to: Jason Hoelscher-Obermaier’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Some relevant discussion here: https://twitter.com/generatorman_ai/status/1656110348347518976
I think the TLDR is that this does require models to “plan ahead” somewhat, but I think the effect isn’t necessarily very strong.
I don’t think “planning” in language models needs to be very mysterious. Because we bias towards answer choices, models can just use these CoT explanations as post-hoc rationalizations. They may internally represent a guess at the answer before doing CoT, and this internal representation can have downstream effects on the explanations (in our case, this biases the reasoning). I think models probably do this often—models are trained to learn representations that will help for all future tokens, not just the next token. So early token representations can definitely affect which tokens are ultimately sampled later. E.g. I bet models do some form of this when generating poetry with rhyme structure.
A qualitative finding that we didn’t put in the paper was that key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation. Sometimes the model does normal reasoning, and then gives some caveat or makes a mistake at the end, that leads it to ultimately give the biased prediction. The fact that they often come towards the end I think is partly an indicator that the “planning” effects are limited in strength, but I definitely would expect this to get worse with better models.

Miles Turpin 5 Jun 2023 20:22 UTC
3 points
0
in reply to: aog’s comment on: Unfaithful Explanations in Chain-of-Thought Prompting
Thanks! Glad you like it. A few thoughts:
- CoT is already incredibly hot, I don’t think we’re adding to the hype. If anything, I’d be more concerned if anyone came away thinking that CoT was a dead-end, because I think doing CoT might be a positive development for safety (as opposed to people trying to get models to do reasoning without externalizing it).
- Improving the faithfulness of CoT explanations doesn’t mean improving the reasoning abilities of LLMs. It just means making the reasoning process more consistent and predictable. The faithfulness of CoT is fairly distinct from the performance that you get from CoT. Any work on improving faithfulness would be really helpful for safety and has minimal capability externalities.

Unfaithful Explanations in Chain-of-Thought Prompting

Miles Turpin3 Jun 2023 0:22 UTC

43 points

8 comments7 min readLW link

Miles Turpin

Do mod­els say what they learn?

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

Bias-Aug­mented Con­sis­tency Train­ing Re­duces Bi­ased Rea­son­ing in Chain-of-Thought

Some Quick Fol­low-Up Ex­per­i­ments to “Taken out of con­text: On mea­sur­ing situ­a­tional aware­ness in LLMs”

Un­faith­ful Ex­pla­na­tions in Chain-of-Thought Prompting

Do models say what they learn?

Reward hacking behavior can generalize across tasks

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

Unfaithful Explanations in Chain-of-Thought Prompting