RSS

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
49 points
4 comments2 min readLW link
(arxiv.org)

A frame­work for think­ing about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC
58 points
7 comments16 min readLW link

AI Con­sti­tu­tions are a tool to re­duce so­cietal scale risk

Sammy Martin25 Jul 2024 11:18 UTC
25 points
0 comments18 min readLW link

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
16 points
0 comments1 min readLW link
(far.ai)

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
113 points
17 comments18 min readLW link

Coal­i­tional agency

Richard_Ngo22 Jul 2024 0:09 UTC
56 points
4 comments6 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
46 points
0 comments4 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
44 points
8 comments1 min readLW link
(storage.googleapis.com)

Safety isn’t safety with­out a so­cial model (or: dis­pel­ling the myth of per se tech­ni­cal safety)

Andrew_Critch14 Jun 2024 0:16 UTC
324 points
34 comments4 min readLW link

An Ex­tremely Opinionated An­no­tated List of My Favourite Mechanis­tic In­ter­pretabil­ity Papers v2

Neel Nanda7 Jul 2024 17:39 UTC
129 points
15 comments24 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
79 points
39 comments7 min readLW link

A more sys­tem­atic case for in­ner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC
30 points
4 comments5 min readLW link

SAE fea­ture ge­om­e­try is out­side the su­per­po­si­tion hypothesis

jake_mendel24 Jun 2024 16:07 UTC
216 points
17 comments11 min readLW link

LLM Gen­er­al­ity is a Timeline Crux

eggsyntax24 Jun 2024 12:52 UTC
201 points
92 comments7 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
40 points
0 comments10 min readLW link

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC
29 points
1 comment2 min readLW link
(arxiv.org)

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
42 points
19 comments5 min readLW link

Ti­maeus is hiring!

12 Jul 2024 23:42 UTC
67 points
4 comments2 min readLW link

For­mal ver­ifi­ca­tion, heuris­tic ex­pla­na­tions and sur­prise accounting

Jacob_Hilton25 Jun 2024 15:40 UTC
147 points
11 comments9 min readLW link
(www.alignment.org)