LLMs seem (relatively) safe

JustisMills25 Apr 2024 22:13 UTC

48 points

18 comments7 min readLW link

(justismills.substack.com)

[Aspiration-based designs] 2. Formal framework, basic algorithm

Jobst Heitzig, Simon Dima and Simon Fischer

28 Apr 2024 13:02 UTC

18 points

2 comments16 min readLW link

Searching for Searching for Search

Rubi J. Hudson14 Feb 2024 23:51 UTC

21 points

3 comments7 min readLW link

Big-endian is better than little-endian

Menotim29 Apr 2024 2:30 UTC

27 points

14 comments3 min readLW link

On Not Pulling The Ladder Up Behind You

Screwtape26 Apr 2024 21:58 UTC

120 points

10 comments9 min readLW link

UDT1.01: The Story So Far (1/10)

Diffractor27 Mar 2024 23:22 UTC

31 points

5 comments13 min readLW link

Ironing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC

87 points

6 comments11 min readLW link

Thoughts on seed oil

dynomight20 Apr 2024 12:29 UTC

266 points

94 comments17 min readLW link

(dynomight.net)

Refusal in LLMs is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib111, wesg and Neel Nanda

27 Apr 2024 11:13 UTC

142 points

52 comments10 min readLW link

The Prop-room and Stage Cognitive Architecture

Robert Kralisch29 Apr 2024 0:48 UTC

8 points

3 comments14 min readLW link

Referential Containment

Robert Kralisch29 Apr 2024 0:16 UTC

2 points

2 comments3 min readLW link

Disentangling Competence and Intelligence

Robert Kralisch29 Apr 2024 0:12 UTC

16 points

4 comments6 min readLW link

Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers

hugofry29 Apr 2024 20:57 UTC

20 points

3 comments11 min readLW link

Estimating the Number of Players from Game Result Percentages

Daniel L28 Apr 2024 17:42 UTC

1 point

2 comments1 min readLW link

Simple probes can catch sleeper agents

Monte M, Carson Denison, Zac Hatfield-Dodds, David Duvenaud, Sam Bowman, Ethan Perez and evhub

23 Apr 2024 21:10 UTC

117 points

15 comments1 min readLW link

(www.anthropic.com)

Losing Faith In Contrarianism

omnizoid25 Apr 2024 20:53 UTC

31 points

40 comments5 min readLW link

Towards a formalization of the agent structure problem

Alex_Altair29 Apr 2024 20:28 UTC

28 points

0 comments14 min readLW link

[Question] Examples of Highly Counterfactual Discoveries?

johnswentworth23 Apr 2024 22:19 UTC

172 points

88 comments1 min readLW link

Open Thread Spring 2024

habryka11 Mar 2024 19:17 UTC

22 points

82 comments1 min readLW link

We are headed into an extreme compute overhang

devrandom26 Apr 2024 21:38 UTC

38 points

15 comments2 min readLW link