[Question] Shane Legg’s necessary properties for every AGI Safety plan

jacquesthibs1 May 2024 17:15 UTC

59 points

8 comments1 min readLW link

Introducing AI Lab Watch

Zach Stein-Perlman30 Apr 2024 17:00 UTC

156 points

7 comments1 min readLW link

(ailabwatch.org)

Mechanistically Eliciting Latent Behaviors in Language Models

Andrew Mack and TurnTrout

30 Apr 2024 18:51 UTC

128 points

19 comments45 min readLW link

ACX Covid Origins Post convinced readers

ErnestScribbler1 May 2024 13:06 UTC

50 points

4 comments2 min readLW link

LessWrong Community Weekend 2024, open for applications

UnplannedCauliflower and jt

1 May 2024 10:18 UTC

56 points

0 comments7 min readLW link

Manifund Q1 Retro: Learnings from impact certs

Austin Chen1 May 2024 16:48 UTC

35 points

0 comments1 min readLW link

Why I’m doing PauseAI

Joseph Miller30 Apr 2024 16:21 UTC

84 points

6 comments4 min readLW link

Ironing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC

137 points

26 comments11 min readLW link

Questions for labs

Zach Stein-Perlman30 Apr 2024 22:15 UTC

54 points

5 comments8 min readLW link

[Linkpost] Silver Bulletin: For most people, politics is about fitting in

Gunnar_Zarncke1 May 2024 18:12 UTC

17 points

0 comments1 min readLW link

(www.natesilver.net)

Take SCIFs, it’s dangerous to go alone

latterframe, Jeffrey Ladish and schroederdewitt

1 May 2024 8:02 UTC

32 points

1 comment3 min readLW link

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky, Philippe Chlenski and Neel Nanda

30 Apr 2024 17:58 UTC

46 points

10 comments17 min readLW link

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

151 points

59 comments10 min readLW link

AXRP Episode 30 - AI Security with Jeffrey Ladish

DanielFilan1 May 2024 2:50 UTC

25 points

0 comments79 min readLW link

KAN: Kolmogorov-Arnold Networks

Gunnar_Zarncke1 May 2024 16:50 UTC

9 points

9 comments1 min readLW link

(arxiv.org)

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

59 points

6 comments11 min readLW link

The Intentional Stance, LLMs Edition

Eleni Angelou30 Apr 2024 17:12 UTC

30 points

2 comments8 min readLW link

On Not Pulling The Ladder Up Behind You

Screwtape26 Apr 2024 21:58 UTC

123 points

11 comments9 min readLW link

The formal goal is a pointer

Pi Rogers1 May 2024 0:27 UTC

19 points

9 comments1 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

46 points

2 comments14 min readLW link