Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
williawa
Karma:
404
All
Posts
Comments
New
Top
Old
Models not making it clear when they’re roleplaying seems like a fairly big issue
williawa
21 Nov 2025 20:23 UTC
16
points
3
comments
6
min read
LW
link
a sketch of how we might go about getting basins of corrigibility from RL
williawa
14 Nov 2025 22:10 UTC
9
points
0
comments
4
min read
LW
link
Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
williawa
23 Aug 2025 21:05 UTC
10
points
14
comments
4
min read
LW
link
williawa’s Shortform
williawa
1 Apr 2025 13:17 UTC
3
points
41
comments
1
min read
LW
link
Bergen – ACX Meetups Everywhere Spring 2025
williawa
25 Mar 2025 23:49 UTC
2
points
0
comments
1
min read
LW
link
Back to top