Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
williawa
Karma:
702
All
Posts
Comments
New
Top
Old
Observations from Running an Agent Collective
williawa
24 Feb 2026 15:34 UTC
34
points
1
comment
10
min read
LW
link
Some evidence against the idea strange CoT stems from incentives to compress language
williawa
10 Dec 2025 22:43 UTC
16
points
0
comments
2
min read
LW
link
Models not making it clear when they’re roleplaying seems like a fairly big issue
williawa
21 Nov 2025 20:23 UTC
16
points
3
comments
6
min read
LW
link
a sketch of how we might go about getting basins of corrigibility from RL
williawa
14 Nov 2025 22:10 UTC
10
points
0
comments
4
min read
LW
link
Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.
williawa
23 Aug 2025 21:05 UTC
10
points
14
comments
5
min read
LW
link
williawa’s Shortform
williawa
1 Apr 2025 13:17 UTC
3
points
53
comments
1
min read
LW
link
Bergen – ACX Meetups Everywhere Spring 2025
williawa
25 Mar 2025 23:49 UTC
2
points
0
comments
1
min read
LW
link
Back to top