RSS

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC
91 points
8 comments2 min readLW link

An In­tro­duc­tion to AI Sandbagging

26 Apr 2024 13:40 UTC
41 points
4 comments8 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC
62 points
4 comments2 min readLW link

Com­pe­ti­tion: Am­plify Ro­hin’s Pre­dic­tion on AGI re­searchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC
82 points
41 comments3 min readLW link

Fix­ing The Good Reg­u­la­tor Theorem

johnswentworth9 Feb 2021 20:30 UTC
136 points
38 comments8 min readLW link1 review

Trans­form­ers Rep­re­sent Belief State Geom­e­try in their Resi­d­ual Stream

Adam Shai16 Apr 2024 21:16 UTC
371 points
86 comments12 min readLW link

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
666 points
205 comments12 min readLW link

LLMs Some­times Gen­er­ate Purely Nega­tively-Re­in­forced Text

Fabien Roger16 Jun 2023 16:31 UTC
176 points
11 comments7 min readLW link

When do “brains beat brawn” in Chess? An experiment

titotal28 Jun 2023 13:33 UTC
294 points
80 comments7 min readLW link
(titotal.substack.com)

Some back­ground for rea­son­ing about dual-use al­ign­ment research

Charlie Steiner18 May 2023 14:50 UTC
125 points
21 comments9 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
187 points
79 comments10 min readLW link

Lin­ear in­fra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC
29 points
2 comments1 min readLW link
(arxiv.org)

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

3 May 2024 1:18 UTC
48 points
6 comments1 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
10 comments9 min readLW link

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
162 points
37 comments45 min readLW link

AI Safety Strate­gies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC
25 points
0 comments42 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
6 comments22 min readLW link

Sum­ming up “Schem­ing AIs” (Sec­tion 5)

Joe Carlsmith9 Dec 2023 15:48 UTC
2 points
1 comment11 min readLW link

Vi­su­al­iz­ing neu­ral net­work planning

9 May 2024 6:40 UTC
4 points
0 comments5 min readLW link

CLR’s re­cent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC
54 points
2 comments13 min readLW link