Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
micahcarroll
Karma:
210
https://micahcarroll.github.io/
All
Posts
Comments
New
Top
Old
Paper: Prompt Optimization Makes Misalignment Legible
Caleb Biddulph
and
micahcarroll
12 Feb 2026 19:45 UTC
51
points
7
comments
10
min read
LW
link
OpenAI: Sidestepping Evaluation Awareness and Anticipating Misalignment with Production Evaluations
Marcus Williams
and
micahcarroll
18 Dec 2025 22:55 UTC
25
points
1
comment
1
min read
LW
link
(alignment.openai.com)
Is the evidence in “Language Models Learn to Mislead Humans via RLHF” valid?
Aaryan Chandna
,
Lukas Fluri
and
micahcarroll
1 Dec 2025 6:50 UTC
35
points
0
comments
19
min read
LW
link
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams
,
micahcarroll
,
Adhyyan Narang
,
Constantin Weisser
and
Brendan Murphy
7 Nov 2024 15:39 UTC
51
points
7
comments
11
min read
LW
link
Back to top