Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Aaryan Chandna
Karma:
28
All
Posts
Comments
New
Top
Old
Is the evidence in “Language Models Learn to Mislead Humans via RLHF” valid?
Aaryan Chandna
,
Lukas Fluri
and
micahcarroll
1 Dec 2025 6:50 UTC
35
points
0
comments
19
min read
LW
link
Aaryan Chandna
30 Nov 2025 18:43 UTC
1
point
0
on:
Natural emergent misalignment from reward hacking in production RL
Interesting, would be cool to know which base model was used!
Back to top
Interesting, would be cool to know which base model was used!