Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

49 points

4 comments2 min readLW link

(arxiv.org)

A framework for thinking about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC

58 points

7 comments16 min readLW link

AI Constitutions are a tool to reduce societal scale risk

Sammy Martin25 Jul 2024 11:18 UTC

25 points

0 comments18 min readLW link

Does robustness improve with scale?

ChengCheng, AdamGleave, Ian McKenzie, Oskar Hollinsworth and Tom Tseng

25 Jul 2024 20:55 UTC

16 points

0 comments1 min readLW link

(far.ai)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

113 points

17 comments18 min readLW link

Coalitional agency

Richard_Ngo22 Jul 2024 0:09 UTC

56 points

4 comments6 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

46 points

0 comments4 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

44 points

8 comments1 min readLW link

(storage.googleapis.com)

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Andrew_Critch14 Jun 2024 0:16 UTC

324 points

34 comments4 min readLW link

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Neel Nanda7 Jul 2024 17:39 UTC

129 points

15 comments24 min readLW link

A simple case for extreme inner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC

79 points

39 comments7 min readLW link

A more systematic case for inner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC

30 points

4 comments5 min readLW link

SAE feature geometry is outside the superposition hypothesis

jake_mendel24 Jun 2024 16:07 UTC

216 points

17 comments11 min readLW link

LLM Generality is a Timeline Crux

eggsyntax24 Jun 2024 12:52 UTC

201 points

92 comments7 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

40 points

0 comments10 min readLW link

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

29 points

1 comment2 min readLW link

(arxiv.org)

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

42 points

19 comments5 min readLW link

Timaeus is hiring!

Jesse Hoogland, Stan van Wingerden, Alexander Gietelink Oldenziel and Daniel Murfet

12 Jul 2024 23:42 UTC

67 points

4 comments2 min readLW link

Formal verification, heuristic explanations and surprise accounting

Jacob_Hilton25 Jun 2024 15:40 UTC

147 points

11 comments9 min readLW link

(www.alignment.org)