Aaryan Chandna

Karma: 28

Is the evidence in “Language Models Learn to Mislead Humans via RLHF” valid?

Aaryan Chandna, Lukas Fluri and micahcarroll

1 Dec 2025 6:50 UTC

35 points

0 comments19 min readLW link

Aaryan Chandna 30 Nov 2025 18:43 UTC
1 point
0
on: Natural emergent misalignment from reward hacking in production RL
Interesting, would be cool to know which base model was used!