RSS

jenny

Karma: 400

Stress Test­ing De­liber­a­tive Align­ment for Anti-Schem­ing Training

17 Sep 2025 16:59 UTC
126 points
19 comments1 min readLW link
(antischeming.ai)

jenny’s Shortform

jenny24 Jan 2025 20:09 UTC
4 points
0 comments1 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jenny11 Oct 2023 15:06 UTC
20 points
0 comments6 min readLW link

Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jenny25 Sep 2023 23:15 UTC
29 points
3 comments7 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

3 Dec 2022 0:59 UTC
34 points
1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

3 Dec 2022 0:59 UTC
34 points
2 comments30 min readLW link

Causal scrub­bing: Appendix

3 Dec 2022 0:58 UTC
18 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

3 Dec 2022 0:58 UTC
206 points
35 comments20 min readLW link1 review