How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

91 points

8 comments2 min readLW link

An Introduction to AI Sandbagging

Teun van der Weij, Felix Hofstätter and Francis Rhys Ward

26 Apr 2024 13:40 UTC

41 points

4 comments8 min readLW link

AISC9 has ended and there will be an AISC10

Linda Linsefors29 Apr 2024 10:53 UTC

62 points

4 comments2 min readLW link

Competition: Amplify Rohin’s Prediction on AGI researchers & Safety Concerns

stuhlmueller21 Jul 2020 20:06 UTC

82 points

41 comments3 min readLW link

Fixing The Good Regulator Theorem

johnswentworth9 Feb 2021 20:30 UTC

136 points

38 comments8 min readLW link 1 review

Transformers Represent Belief State Geometry in their Residual Stream

Adam Shai16 Apr 2024 21:16 UTC

371 points

86 comments12 min readLW link

SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow and mwatkins

5 Feb 2023 22:02 UTC

666 points

205 comments12 min readLW link

LLMs Sometimes Generate Purely Negatively-Reinforced Text

Fabien Roger16 Jun 2023 16:31 UTC

176 points

11 comments7 min readLW link

When do “brains beat brawn” in Chess? An experiment

titotal28 Jun 2023 13:33 UTC

294 points

80 comments7 min readLW link

(titotal.substack.com)

Some background for reasoning about dual-use alignment research

Charlie Steiner18 May 2023 14:50 UTC

125 points

21 comments9 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

187 points

79 comments10 min readLW link

Linear infra-Bayesian Bandits

Vanessa Kosoy10 May 2024 6:41 UTC

29 points

2 comments1 min readLW link

(arxiv.org)

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC and Fazl

3 May 2024 1:18 UTC

48 points

6 comments1 min readLW link

Interpreting the Learning of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC

30 points

10 comments9 min readLW link

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

162 points

37 comments45 min readLW link

AI Safety Strategies Landscape

Charbel-Raphaël9 May 2024 17:33 UTC

25 points

0 comments42 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

106 points

6 comments22 min readLW link

Summing up “Scheming AIs” (Section 5)

Joe Carlsmith9 Dec 2023 15:48 UTC

2 points

1 comment11 min readLW link

Visualizing neural network planning

Nevan Wichers, Victor Tao, Fazl and Riccardo Volpato

9 May 2024 6:40 UTC

4 points

0 comments5 min readLW link

CLR’s recent work on multi-agent systems

JesseClifton9 Mar 2021 2:28 UTC

54 points

2 comments13 min readLW link