RSS

micahcarroll

Karma: 164

https://​​micahcarroll.github.io/​​

OpenAI: Sidestep­ping Eval­u­a­tion Aware­ness and An­ti­ci­pat­ing Misal­ign­ment with Pro­duc­tion Evaluations

18 Dec 2025 22:55 UTC
25 points
0 comments1 min readLW link
(alignment.openai.com)

Is the ev­i­dence in “Lan­guage Models Learn to Mislead Hu­mans via RLHF” valid?

1 Dec 2025 6:50 UTC
35 points
0 comments19 min readLW link

On Tar­geted Ma­nipu­la­tion and De­cep­tion when Op­ti­miz­ing LLMs for User Feedback

7 Nov 2024 15:39 UTC
51 points
7 comments11 min readLW link