Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

49 points

4 comments2 min readLW link

(arxiv.org)

Does robustness improve with scale?

ChengCheng, AdamGleave, Ian McKenzie, Oskar Hollinsworth and Tom Tseng

25 Jul 2024 20:55 UTC

16 points

0 comments1 min readLW link

(far.ai)

AI Constitutions are a tool to reduce societal scale risk

Sammy Martin25 Jul 2024 11:18 UTC

25 points

0 comments18 min readLW link

A framework for thinking about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC

58 points

7 comments16 min readLW link

Coalitional agency

Richard_Ngo22 Jul 2024 0:09 UTC

56 points

4 comments6 min readLW link

aimless ace analyzes active amateur: a micro-aaaaalignment proposal

lukehmiles21 Jul 2024 12:37 UTC

10 points

0 comments1 min readLW link

A more systematic case for inner misalignment

Richard_Ngo20 Jul 2024 5:03 UTC

30 points

4 comments5 min readLW link

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

46 points

0 comments4 min readLW link

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

44 points

8 comments1 min readLW link

(storage.googleapis.com)

Truth is Universal: Robust Detection of Lies in LLMs

Lennart Buerger19 Jul 2024 14:07 UTC

29 points

1 comment2 min readLW link

(arxiv.org)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

113 points

17 comments18 min readLW link

SAEs (usually) Transfer Between Base and Chat Models

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

18 Jul 2024 10:29 UTC

40 points

0 comments10 min readLW link

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

42 points

19 comments5 min readLW link

An Introduction to Representation Engineering—an activation-based paradigm for controlling LLMs

Jan Wehner14 Jul 2024 10:37 UTC

23 points

4 comments17 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

34 points

12 comments12 min readLW link

A simple case for extreme inner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC

79 points

39 comments7 min readLW link

Timaeus is hiring!

Jesse Hoogland, Stan van Wingerden, Alexander Gietelink Oldenziel and Daniel Murfet

12 Jul 2024 23:42 UTC

67 points

4 comments2 min readLW link

Games for AI Control

charlie_griffin and Buck

11 Jul 2024 18:40 UTC

28 points

0 comments4 min readLW link

UC Berkeley course on LLMs and ML Safety

Dan H9 Jul 2024 15:40 UTC

36 points

1 comment1 min readLW link

(rdi.berkeley.edu)