RSS

On “first crit­i­cal tries” in AI alignment

Joe Carlsmith5 Jun 2024 0:19 UTC
54 points
5 comments14 min readLW link

There are no co­her­ence theorems

20 Feb 2023 21:25 UTC
128 points
123 comments19 min readLW link

Beyond Kol­mogorov and Shannon

25 Oct 2022 15:13 UTC
62 points
19 comments5 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
47 points
4 comments2 min readLW link
(arxiv.org)

A frame­work for think­ing about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC
58 points
7 comments16 min readLW link

A sim­ple case for ex­treme in­ner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC
79 points
39 comments7 min readLW link

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
16 points
0 comments1 min readLW link
(far.ai)

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
44 points
8 comments1 min readLW link
(storage.googleapis.com)

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

19 Jul 2024 20:32 UTC
59 points
6 comments16 min readLW link

AI Con­sti­tu­tions are a tool to re­duce so­cietal scale risk

Sammy Martin25 Jul 2024 11:18 UTC
25 points
0 comments18 min readLW link

Value sys­tem­ati­za­tion: how val­ues be­come co­her­ent (and mis­al­igned)

Richard_Ngo27 Oct 2023 19:06 UTC
100 points
48 comments13 min readLW link

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC
42 points
19 comments5 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
113 points
17 comments18 min readLW link

Coal­i­tional agency

Richard_Ngo22 Jul 2024 0:09 UTC
56 points
4 comments6 min readLW link

Stitch­ing SAEs of differ­ent sizes

13 Jul 2024 17:19 UTC
34 points
12 comments12 min readLW link

Koan: di­v­in­ing alien datas­truc­tures from RAM activations

TsviBT5 Apr 2024 18:04 UTC
42 points
10 comments21 min readLW link

Ti­maeus is hiring!

12 Jul 2024 23:42 UTC
67 points
4 comments2 min readLW link

Prevent­ing model exfil­tra­tion with up­load limits

ryan_greenblatt6 Feb 2024 16:29 UTC
66 points
21 comments14 min readLW link

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC
84 points
12 comments4 min readLW link

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
110 points
14 comments6 min readLW link