On “first critical tries” in AI alignment

Joe Carlsmith5 Jun 2024 0:19 UTC

54 points

5 comments14 min readLW link

There are no coherence theorems

20 Feb 2023 21:25 UTC

128 points

123 comments19 min readLW link

Beyond Kolmogorov and Shannon

Alexander Gietelink Oldenziel and Adam Shai

25 Oct 2022 15:13 UTC

62 points

19 comments5 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

47 points

4 comments2 min readLW link

(arxiv.org)

A framework for thinking about AI power-seeking

Joe Carlsmith24 Jul 2024 22:41 UTC

58 points

7 comments16 min readLW link

A simple case for extreme inner misalignment

Richard_Ngo13 Jul 2024 15:40 UTC

79 points

39 comments7 min readLW link

Does robustness improve with scale?

ChengCheng, AdamGleave, Ian McKenzie, Oskar Hollinsworth and Tom Tseng

25 Jul 2024 20:55 UTC

16 points

0 comments1 min readLW link

(far.ai)

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Senthooran Rajamanoharan, Tom Lieberum, nps29, Arthur Conmy, Vikrant Varma, János Kramár and Neel Nanda

19 Jul 2024 16:10 UTC

44 points

8 comments1 min readLW link

(storage.googleapis.com)

Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions

Lidor Banuel Dabbah and Aviel Boag

19 Jul 2024 20:32 UTC

59 points

6 comments16 min readLW link

AI Constitutions are a tool to reduce societal scale risk

Sammy Martin25 Jul 2024 11:18 UTC

25 points

0 comments18 min readLW link

Value systematization: how values become coherent (and misaligned)

Richard_Ngo27 Oct 2023 19:06 UTC

100 points

48 comments13 min readLW link

Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural

Rubi J. Hudson16 Jul 2024 22:44 UTC

42 points

19 comments5 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

113 points

17 comments18 min readLW link

Coalitional agency

Richard_Ngo22 Jul 2024 0:09 UTC

56 points

4 comments6 min readLW link

Stitching SAEs of different sizes

Bart Bussmann, Patrick Leask, Joseph Bloom, Curt Tigges and Neel Nanda

13 Jul 2024 17:19 UTC

34 points

12 comments12 min readLW link

Koan: divining alien datastructures from RAM activations

TsviBT5 Apr 2024 18:04 UTC

42 points

10 comments21 min readLW link

Timaeus is hiring!

Jesse Hoogland, Stan van Wingerden, Alexander Gietelink Oldenziel and Daniel Murfet

12 Jul 2024 23:42 UTC

67 points

4 comments2 min readLW link

Preventing model exfiltration with upload limits

ryan_greenblatt6 Feb 2024 16:29 UTC

66 points

21 comments14 min readLW link

Inner Misalignment in “Simulator” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC

84 points

12 comments4 min readLW link

Measuring and Improving the Faithfulness of Model-Generated Reasoning

Ansh Radhakrishnan, tamera, karinanguyen, Sam Bowman and Ethan Perez

18 Jul 2023 16:36 UTC

110 points

14 comments6 min readLW link