RSS

Daniel Lee

Karma: 79

Find­ing Fea­tures Causally Up­stream of Refusal

14 Jan 2025 2:30 UTC
55 points
6 comments12 min readLW link

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link