Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Zachary Witten
Karma:
77
All
Posts
Comments
New
Top
Old
Won’t vs. Can’t: Sandbagging-like Behavior from Claude Models
Joe Benton
and
Zachary Witten
19 Feb 2025 20:47 UTC
15
points
1
comment
1
min read
LW
link
(alignment.anthropic.com)
Content Features Aren’t Enough for Detecting Toxicity. One Needs User Features.
Zachary Witten
14 Feb 2023 18:48 UTC
11
points
0
comments
3
min read
LW
link
Back to top