RSS

AXRP Epi­sode 29 - Science of Deep Learn­ing with Vikrant Varma

DanielFilan25 Apr 2024 19:10 UTC
13 points
0 comments63 min readLW link

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
25 points
0 comments1 min readLW link
(arxiv.org)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
112 points
13 comments1 min readLW link
(www.anthropic.com)

De­quan­tify­ing first-or­der theories

jessicata23 Apr 2024 19:04 UTC
37 points
8 comments8 min readLW link
(unstableontology.com)

ProLU: A Non­lin­ear­ity for Sparse Autoencoders

Glen Taggart23 Apr 2024 14:09 UTC
29 points
2 comments7 min readLW link

Time com­plex­ity for de­ter­minis­tic string machines

alcatal21 Apr 2024 22:35 UTC
14 points
0 comments21 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

19 Apr 2024 20:00 UTC
35 points
6 comments16 min readLW link

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
70 points
8 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
68 points
0 comments3 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam Marks18 Apr 2024 16:17 UTC
93 points
5 comments12 min readLW link

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
43 points
1 comment3 min readLW link
(tiny.cc)

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
301 points
63 comments12 min readLW link

Speedrun ru­iner re­search idea

lukehmiles13 Apr 2024 23:42 UTC
4 points
11 comments2 min readLW link

AXRP Epi­sode 27 - AI Con­trol with Buck Sh­legeris and Ryan Greenblatt

DanielFilan11 Apr 2024 21:30 UTC
67 points
10 comments107 min readLW link

The the­ory of Prox­i­mal Policy Op­ti­mi­sa­tion implementations

salman.mohammadi11 Apr 2024 13:00 UTC
3 points
1 comment6 min readLW link
(salmanmohammadi.github.io)

How I se­lect al­ign­ment re­search projects

10 Apr 2024 4:33 UTC
34 points
4 comments24 min readLW link

PIBBSS is hiring in a va­ri­ety of roles (al­ign­ment re­search and in­cu­ba­tion pro­gram)

9 Apr 2024 8:12 UTC
47 points
0 comments3 min readLW link

How We Pic­ture Bayesian Agents

8 Apr 2024 18:12 UTC
65 points
11 comments7 min readLW link

Mea­sur­ing Learned Op­ti­miza­tion in Small Trans­former Models

J Bostock8 Apr 2024 14:41 UTC
22 points
0 comments11 min readLW link

Mea­sur­ing Pre­dictabil­ity of Per­sona Evaluations

6 Apr 2024 8:46 UTC
19 points
0 comments7 min readLW link