RSS

robertzk

Karma: 674

SAEs are highly dataset de­pen­dent: a case study on the re­fusal direction

Nov 7, 2024, 5:22 AM
67 points

25 votes

Overall karma indicates overall quality.

4 comments14 min readLW link

Open Source Repli­ca­tion of An­thropic’s Cross­coder pa­per for model-diffing

Oct 27, 2024, 6:46 PM
48 points

17 votes

Overall karma indicates overall quality.

4 comments5 min readLW link

Base LLMs re­fuse too

Sep 29, 2024, 4:04 PM
61 points

22 votes

Overall karma indicates overall quality.

20 comments10 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
67 points

23 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

At­ten­tion Out­put SAEs Im­prove Cir­cuit Analysis

Jun 21, 2024, 12:56 PM
33 points

19 votes

Overall karma indicates overall quality.

3 comments19 min readLW link

We In­spected Every Head In GPT-2 Small us­ing SAEs So You Don’t Have To

Mar 6, 2024, 5:03 AM
63 points

28 votes

Overall karma indicates overall quality.

0 comments12 min readLW link

At­ten­tion SAEs Scale to GPT-2 Small

Feb 3, 2024, 6:50 AM
78 points

26 votes

Overall karma indicates overall quality.

4 comments8 min readLW link

Sparse Au­toen­coders Work on At­ten­tion Layer Outputs

Jan 16, 2024, 12:26 AM
85 points

33 votes

Overall karma indicates overall quality.

9 comments18 min readLW link

Train­ing Pro­cess Trans­parency through Gra­di­ent In­ter­pretabil­ity: Early ex­per­i­ments on toy lan­guage models

Jul 21, 2023, 2:52 PM
56 points

17 votes

Overall karma indicates overall quality.

1 comment1 min readLW link

Get­ting up to Speed on the Speed Prior in 2022

robertzkDec 28, 2022, 7:49 AM
36 points

13 votes

Overall karma indicates overall quality.

5 comments65 min readLW link

Emily Brontë on: Psy­chol­ogy Re­quired for Se­ri­ous™ AGI Safety Research

robertzkSep 14, 2022, 2:47 PM
2 points

3 votes

Overall karma indicates overall quality.

0 comments1 min readLW link