keith_wynroe

Karma: 323

Do Models Lie More to Other Models?

keith_wynroe28 May 2026 19:28 UTC

13 points

0 comments6 min readLW link

Asymmetry Between Defensive and Acquisitive Instrumental Deception

keith_wynroe10 May 2026 12:33 UTC

17 points

1 comment5 min readLW link

Finding an Error-Detection Feature in DeepSeek-R1

keith_wynroe24 Apr 2025 16:03 UTC

23 points

0 comments7 min readLW link

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

keith_wynroe and Lee Sharkey

2 Jul 2024 13:17 UTC

87 points

7 comments12 min readLW link

An OV-Coherent Toy Model of Attention Head Superposition

Lauren Greenspan and keith_wynroe

29 Aug 2023 19:44 UTC

26 points

2 comments6 min readLW link

Literature review of TAI timelines

Jsevillamol, keith_wynroe and David Atkinson

27 Jan 2023 20:07 UTC

35 points

7 comments2 min readLW link

(epochai.org)

You’re Not One “You”—How Decision Theories Are Talking Past Each Other

keith_wynroe9 Jan 2023 1:21 UTC

30 points

11 comments8 min readLW link