Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

127 points

40 comments10 min readLW link

Referential Containment

Robert Kralisch29 Apr 2024 0:16 UTC

2 points

1 comment3 min readLW link

Dequantifying first-order theories

jessicata23 Apr 2024 19:04 UTC

39 points

9 comments8 min readLW link

(unstableontology.com)

[Concept Dependency] Concept Dependency Posts

Johannes C. Mayer26 Apr 2024 20:57 UTC

10 points

3 comments2 min readLW link

So What’s Up With PUFAs Chemically?

J Bostock27 Apr 2024 13:32 UTC

51 points

22 comments6 min readLW link

Big-endian is better than little-endian

Menotim29 Apr 2024 2:30 UTC

9 points

0 comments3 min readLW link

This is Water by David Foster Wallace

Nathan Young24 Apr 2024 21:21 UTC

54 points

16 comments13 min readLW link

(fs.blog)

Book review: Deep Utopia

PeterMcCluskey23 Apr 2024 19:55 UTC

38 points

10 comments4 min readLW link

(bayesianinvestor.com)

[Question] Examples of Highly Counterfactual Discoveries?

johnswentworth23 Apr 2024 22:19 UTC

167 points

86 comments1 min readLW link

The Prop-room and Stage Cognitive Architecture

Robert Kralisch29 Apr 2024 0:48 UTC

0 points

0 comments14 min readLW link

How are Simulators and Agents related?

Robert Kralisch29 Apr 2024 0:22 UTC

4 points

0 comments7 min readLW link

Extended Embodiment

Robert Kralisch29 Apr 2024 0:18 UTC

4 points

0 comments3 min readLW link

Disentangling Competence and Intelligence

Robert Kralisch29 Apr 2024 0:12 UTC

2 points

0 comments6 min readLW link

Constructability: Plainly-coded AGIs may be feasible in the near future

Épiphanie Gédéon and Charbel-Raphaël

27 Apr 2024 16:04 UTC

58 points

9 comments13 min readLW link

Losing Faith In Contrarianism

omnizoid25 Apr 2024 20:53 UTC

31 points

39 comments5 min readLW link

An Unintentional Compliment

abstractapplic and lsusr

28 Apr 2024 20:04 UTC

21 points

1 comment4 min readLW link

[Aspiration-based designs] 1. Informal introduction

B Jacobs, Jobst Heitzig, Simon Fischer and Simon Dima

28 Apr 2024 13:00 UTC

30 points

2 comments8 min readLW link

D&D.Sci

abstractapplic5 Dec 2020 23:26 UTC

65 points

50 comments1 min readLW link

List your AI X-Risk cruxes!

Aryeh Englander28 Apr 2024 18:26 UTC

27 points

3 comments2 min readLW link

Unintentionally Creating Value

abstractapplic and lsusr

28 Apr 2024 20:05 UTC

19 points

0 comments2 min readLW link