williawa

Karma: 859

GPT5.5 Released

williawa23 Apr 2026 18:44 UTC

11 points

4 comments1 min readLW link

(openai.com)

Excerpts and Notes on Mythos Model Card

williawa8 Apr 2026 15:10 UTC

45 points

0 comments13 min readLW link

Propositional Alignment

williawa30 Mar 2026 13:50 UTC

7 points

0 comments2 min readLW link

The Future of Aligning Deep Learning systems will probably look like “training on interp”

williawa20 Mar 2026 23:06 UTC

28 points

7 comments4 min readLW link

Observations from Running an Agent Collective

williawa24 Feb 2026 15:34 UTC

45 points

2 comments10 min readLW link

Some evidence against the idea strange CoT stems from incentives to compress language

williawa10 Dec 2025 22:43 UTC

17 points

0 comments2 min readLW link

Models not making it clear when they’re roleplaying seems like a fairly big issue

williawa21 Nov 2025 20:23 UTC

16 points

3 comments6 min readLW link

a sketch of how we might go about getting basins of corrigibility from RL

williawa14 Nov 2025 22:10 UTC

10 points

0 comments4 min readLW link

Thoughts About how RLHF and Related “Prosaic” Approaches Could be Used to Create Robustly Aligned AIs.

williawa23 Aug 2025 21:05 UTC

10 points

14 comments5 min readLW link

williawa’s Shortform

williawa1 Apr 2025 13:17 UTC

3 points

59 comments1 min readLW link

Bergen – ACX Meetups Everywhere Spring 2025

williawa25 Mar 2025 23:49 UTC

2 points

0 comments1 min readLW link