RSS

jenny

Karma: 388

Stress Test­ing De­liber­a­tive Align­ment for Anti-Schem­ing Training

Sep 17, 2025, 4:59 PM
124 points

37 votes

Overall karma indicates overall quality.

13 comments1 min readLW link
(antischeming.ai)

jenny’s Shortform

jennyJan 24, 2025, 8:09 PM
4 points

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

At­tribut­ing to in­ter­ac­tions with GCPD and GWPD

jennyOct 11, 2023, 3:06 PM
20 points

7 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jennySep 25, 2023, 11:15 PM
29 points

8 votes

Overall karma indicates overall quality.

3 comments7 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

Dec 3, 2022, 12:59 AM
34 points

12 votes

Overall karma indicates overall quality.

1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

Dec 3, 2022, 12:59 AM
34 points

12 votes

Overall karma indicates overall quality.

2 comments30 min readLW link

Causal scrub­bing: Appendix

Dec 3, 2022, 12:58 AM
18 points

7 votes

Overall karma indicates overall quality.

4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

Dec 3, 2022, 12:58 AM
206 points

76 votes

Overall karma indicates overall quality.

35 comments20 min readLW link1 review