RSS

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
85 points
7 comments1 min readLW link
(www.anthropic.com)

De­quan­tify­ing first-or­der theories

jessicata23 Apr 2024 19:04 UTC
32 points
0 comments8 min readLW link
(unstableontology.com)

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
293 points
61 comments12 min readLW link

ProLU: A Pareto Im­prove­ment for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
4 points
0 comments7 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
69 points
8 comments8 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
90 points
5 comments12 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
66 points
0 comments3 min readLW link

Time com­plex­ity for de­ter­minis­tic string machines

alcatal21 Apr 2024 22:35 UTC
14 points
0 comments21 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
34 points
6 comments16 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
43 points
1 comment3 min readLW link
(tiny.cc)

LLMs for Align­ment Re­search: a safety pri­or­ity?

abramdemski4 Apr 2024 20:03 UTC
138 points
23 comments11 min readLW link

Modern Trans­form­ers are AGI, and Hu­man-Level

abramdemski26 Mar 2024 17:46 UTC
196 points
89 comments5 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
67 points
10 comments107 min readLW link

How We Pic­ture Bayesian Agents

8 Apr 2024 18:12 UTC
65 points
11 comments7 min readLW link

A Selec­tion of Ran­domly Selected SAE Features

1 Apr 2024 9:09 UTC
104 points
2 comments4 min readLW link

Spar­sify: A mechanis­tic in­ter­pretabil­ity re­search agenda

Lee Sharkey3 Apr 2024 12:34 UTC
84 points
20 comments22 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
85 points
15 comments8 min readLW link

SAE-VIS: An­nounce­ment Post

31 Mar 2024 15:30 UTC
73 points
8 comments1 min readLW link

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
34 points
4 comments24 min readLW link

PIBBSS is hiring in a va­ri­ety of roles (al­ign­ment re­search and in­cu­ba­tion pro­gram)

9 Apr 2024 8:12 UTC
47 points
0 comments3 min readLW link