All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

All Jan Feb Mar Apr MayJunJul Aug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 101112 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

DPO/PPO-RLHF on LLMs incentivizes sycophancy, exaggeration and deceptive hallucination, but not misaligned powerseeking

tailcalled10 Jun 2024 21:20 UTC

29 points

13 comments2 min readLW link

Plop! Goes the Concept

Jonathan Moregård10 Jun 2024 19:23 UTC

6 points

0 comments8 min readLW link

(honestliving.substack.com)

What can we learn from orcas?

Jonasb10 Jun 2024 18:01 UTC

1 point

0 comments8 min readLW link

(www.denominations.io)

How to build a data center, by Construction Physics

TheManxLoiner10 Jun 2024 17:38 UTC

2 points

0 comments1 min readLW link

(www.construction-physics.com)

Observations for doing debate with models behind APIs

PoD12310 Jun 2024 16:22 UTC

3 points

0 comments3 min readLW link

My AI Model Delta Compared To Yudkowsky

johnswentworth10 Jun 2024 16:12 UTC

279 points

107 comments4 min readLW link

[Question] Good ways to monetarily profit from the increasing demand for power?

Trinley Goldenberg10 Jun 2024 15:29 UTC

12 points

5 comments1 min readLW link

The Evolution towards the Blank Slate

Arturo Macias10 Jun 2024 15:20 UTC

−6 points

0 comments3 min readLW link

10 Public “I was wrong” Admissions by Scientists and Intellectuals

Hashem ElAssad10 Jun 2024 14:19 UTC

0 points

3 comments1 min readLW link

[Valence series] 4. Valence & Liking / Admiring

Steven Byrnes10 Jun 2024 14:19 UTC

49 points

18 comments17 min readLW link

5. Open Corrigibility Questions

Max Harms10 Jun 2024 14:09 UTC

32 points

1 comment7 min readLW link

4. Existing Writing on Corrigibility

Max Harms10 Jun 2024 14:08 UTC

65 points

22 comments106 min readLW link

On Dwarksh’s Podcast with Leopold Aschenbrenner

Zvi10 Jun 2024 12:40 UTC

102 points

7 comments59 min readLW link

(thezvi.wordpress.com)

Summary of Situational Awareness—The Decade Ahead

Oscar10 Jun 2024 8:44 UTC

6 points

2 comments1 min readLW link

(forum.effectivealtruism.org)

Why I don’t believe in the placebo effect

transhumanist_atom_understander10 Jun 2024 2:37 UTC

148 points

23 comments9 min readLW link

Soviet comedy film recommendations

Nina Panickssery9 Jun 2024 23:40 UTC

42 points

11 comments2 min readLW link

(open.substack.com)

The Data Wall is Important

JustisMills9 Jun 2024 22:54 UTC

40 points

20 comments2 min readLW link

(justismills.substack.com)

Two Family Dance Flyers

jefftk9 Jun 2024 20:50 UTC

13 points

0 comments1 min readLW link

(www.jefftk.com)

[Question] What happens to existing life sentences under LEV?

O O9 Jun 2024 17:49 UTC

5 points

7 comments1 min readLW link

3b. Formal (Faux) Corrigibility

Max Harms9 Jun 2024 17:18 UTC

26 points

19 comments17 min readLW link

3a. Towards Formal Corrigibility

Max Harms9 Jun 2024 16:53 UTC

30 points

12 comments19 min readLW link

Introducing SARA: a new activation steering technique

Alejandro Tlaie9 Jun 2024 15:33 UTC

17 points

7 comments6 min readLW link

“What the hell is a representation, anyway?” | Clarifying AI interpretability with tools from philosophy of cognitive science | Part 1: Vehicles vs. contents

IwanWilliams9 Jun 2024 14:19 UTC

9 points

1 comment4 min readLW link

Exploring Llama-3-8B MLP Neurons

ntt1239 Jun 2024 14:19 UTC

10 points

0 comments4 min readLW link

(neuralblog.github.io)

Demystifying “Alignment” through a Comic

milanrosko9 Jun 2024 8:24 UTC

109 points

19 comments1 min readLW link

Dumbing down

Martin Sustrik9 Jun 2024 6:50 UTC

74 points

1 comment4 min readLW link

What if a tech company forced you to move to NYC?

KatjaGrace9 Jun 2024 6:30 UTC

95 points

24 comments1 min readLW link 2 reviews

(worldspiritsockpuppet.com)

[Question] What should I do? (long term plan about starting an AI lab)

not_a_cat9 Jun 2024 0:45 UTC

2 points

1 comment2 min readLW link

Searching for the Root of the Tree of Evil

Ivan Vendrov8 Jun 2024 17:05 UTC

40 points

14 comments5 min readLW link

(nothinghuman.substack.com)

2. Corrigibility Intuition

Max Harms8 Jun 2024 15:52 UTC

86 points

11 comments33 min readLW link

Two easy things that maybe Just Work to improve AI discourse

Bird Concept8 Jun 2024 15:51 UTC

192 points

34 comments2 min readLW link

I made an AI safety fellowship. What I wish I knew.

Ruben Castaing8 Jun 2024 15:23 UTC

12 points

0 comments2 min readLW link

Alignment Gaps

kcyras8 Jun 2024 15:23 UTC

11 points

4 comments8 min readLW link

The Slack Double Crux, or how to negotiate with yourself

Thac08 Jun 2024 15:22 UTC

7 points

2 comments4 min readLW link

The Perils of Popularity: A Critical Examination of LessWrong’s Rational Discourse

BubbaJoeLouis8 Jun 2024 15:22 UTC

−24 points

3 comments2 min readLW link

Status quo bias is usually justified

Amadeus Pagel8 Jun 2024 14:54 UTC

10 points

3 comments1 min readLW link

(amadeuspagel.substack.com)

Closed-Source Evaluations

Jono8 Jun 2024 14:18 UTC

15 points

4 comments1 min readLW link

Access to powerful AI might make computer security radically easier

Buck8 Jun 2024 6:00 UTC

108 points

14 comments6 min readLW link

[Question] Why don’t we just get rid of all the bioethicists?

Sable8 Jun 2024 3:48 UTC

12 points

0 comments1 min readLW link

Sev, Sevteen, Sevty, Sevth

jefftk8 Jun 2024 2:30 UTC

17 points

9 comments1 min readLW link

(www.jefftk.com)

1. The CAST Strategy

Max Harms7 Jun 2024 22:29 UTC

58 points

27 comments38 min readLW link

0. CAST: Corrigibility as Singular Target

Max Harms7 Jun 2024 22:29 UTC

163 points

24 comments9 min readLW link 2 reviews

What is space? What is time?

Tahp7 Jun 2024 22:15 UTC

8 points

3 comments7 min readLW link

[Question] Question about Lewis’ counterfactual theory of causation

jbkjr7 Jun 2024 20:15 UTC

12 points

7 comments1 min readLW link

Relationships among words, metalingual definition, and interpretability

Bill Benzon7 Jun 2024 19:18 UTC

2 points

0 comments5 min readLW link

Let’s Talk About Emergence

jacobhaimes7 Jun 2024 19:18 UTC

4 points

0 comments7 min readLW link

(www.odysseaninstitute.org)

D&D.Sci Alchemy: Archmage Anachronos and the Supply Chain Issues

aphyer7 Jun 2024 19:02 UTC

43 points

18 comments3 min readLW link 2 reviews

Natural Latents Are Not Robust To Tiny Mixtures

johnswentworth and David Lorell

7 Jun 2024 18:53 UTC

65 points

8 comments5 min readLW link

Situational Awareness Summarized—Part 2

Joe Rogero7 Jun 2024 17:20 UTC

12 points

2 comments4 min readLW link

Frida van Lisa, a short story about adversarial AI attacks on humans

arisAlexis7 Jun 2024 13:22 UTC

2 points

0 comments18 min readLW link