RSS

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
49 points
4 comments2 min readLW link
(arxiv.org)

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
16 points
0 comments1 min readLW link
(far.ai)

AI Con­sti­tu­tions are a tool to re­duce so­cietal scale risk

Sammy Martin25 Jul 2024 11:18 UTC
25 points
0 comments18 min readLW link

A frame­work for think­ing about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC
58 points
7 comments16 min readLW link

Coal­i­tional agency

Richard_Ngo22 Jul 2024 0:09 UTC
56 points
4 comments6 min readLW link

aim­less ace an­a­lyzes ac­tive am­a­teur: a micro-aaaaal­ign­ment proposal

lukehmiles21 Jul 2024 12:37 UTC
10 points
0 comments1 min readLW link

A more sys­tem­atic case for in­ner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC
30 points
4 comments5 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

20 Jul 2024 2:20 UTC
46 points
0 comments4 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
44 points
8 comments1 min readLW link
(storage.googleapis.com)

Truth is Univer­sal: Ro­bust De­tec­tion of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC
29 points
1 comment2 min readLW link
(arxiv.org)

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
113 points
17 comments18 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

18 Jul 2024 10:29 UTC
40 points
0 comments10 min readLW link

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
42 points
19 comments5 min readLW link

An In­tro­duc­tion to Rep­re­sen­ta­tion Eng­ineer­ing—an ac­ti­va­tion-based paradigm for con­trol­ling LLMs

Jan Wehner14 Jul 2024 10:37 UTC
23 points
4 comments17 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
34 points
12 comments12 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
79 points
39 comments7 min readLW link

Ti­maeus is hiring!

12 Jul 2024 23:42 UTC
67 points
4 comments2 min readLW link

Games for AI Control

11 Jul 2024 18:40 UTC
28 points
0 comments4 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan H9 Jul 2024 15:40 UTC
36 points
1 comment1 min readLW link
(rdi.berkeley.edu)