All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024

All Jan Feb Mar Apr May Jun Jul Aug Sep Oct NovDec

All 1 2 3 4 5 6 7 8 9 10 111213 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Secondary Risk Markets

Vaniver11 Dec 2023 21:52 UTC

35 points

4 comments4 min readLW link

Has anyone experimented with Dodrio, a tool for exploring transformer models through interactive visualization?

Bill Benzon11 Dec 2023 20:34 UTC

4 points

0 comments1 min readLW link

[Valence series] 3. Valence & Beliefs

Steven Byrnes11 Dec 2023 20:21 UTC

63 points

6 comments21 min readLW link

[Question] Am I ethically obligated to extend the life of my dog with life-extension treatments about to hit the market?

TrudosKudos11 Dec 2023 19:41 UTC

−3 points

1 comment1 min readLW link

Adversarial Robustness Could Help Prevent Catastrophic Misuse

aogara11 Dec 2023 19:12 UTC

30 points

18 comments9 min readLW link

The Consciousness Box

GradualImprovement11 Dec 2023 16:45 UTC

33 points

22 comments4 min readLW link

Empirical work that might shed light on scheming (Section 6 of “Scheming AIs”)

Joe Carlsmith11 Dec 2023 16:30 UTC

8 points

0 comments21 min readLW link

Into AI Safety: Episode 3

jacobhaimes11 Dec 2023 16:30 UTC

6 points

0 comments1 min readLW link

(into-ai-safety.github.io)

Implicitly Typed C

jefftk11 Dec 2023 16:10 UTC

16 points

0 comments1 min readLW link

(www.jefftk.com)

37C3 Hacker x Rationalist Meetup

Kiboneu and ctrltab

11 Dec 2023 16:02 UTC

5 points

5 comments1 min readLW link

re: Yudkowsky on biological materials

bhauth11 Dec 2023 13:28 UTC

179 points

30 comments5 min readLW link

Ideoculture

elv11 Dec 2023 10:29 UTC

8 points

2 comments6 min readLW link

Quick thoughts on the implications of multi-agent views of mind on AI takeover

Kaj_Sotala11 Dec 2023 6:34 UTC

40 points

14 comments4 min readLW link

Auditing failures vs concentrated failures

ryan_greenblatt and Fabien Roger

11 Dec 2023 2:47 UTC

44 points

0 comments7 min readLW link

Facing Up to the Problem of Consciousness

Bruce W. Lee10 Dec 2023 23:31 UTC

8 points

0 comments3 min readLW link

Deeply Cover Car Crashes?

jefftk10 Dec 2023 22:20 UTC

36 points

31 comments1 min readLW link

(www.jefftk.com)

Principles For Product Liability (With Application To AI)

johnswentworth10 Dec 2023 21:27 UTC

37 points

55 comments10 min readLW link

[Question] What do you do to remember and reference the LessWrong posts that were most personally significant to you, in terms of intellectual development or general usefulness?

lillybaeum10 Dec 2023 17:52 UTC

5 points

7 comments1 min readLW link

[Question] Do websites and apps actually generally get worse after updates, or is it just an effect of the fear of change?

lillybaeum10 Dec 2023 17:26 UTC

33 points

34 comments2 min readLW link

How LDT helps reduce the AI arms race

Tamsin Leake10 Dec 2023 16:21 UTC

70 points

13 comments4 min readLW link

(carado.moe)

Understanding Subjective Probabilities

Isaac King10 Dec 2023 6:03 UTC

30 points

16 comments10 min readLW link

Send us example gnarly bugs

Beth Barnes, Megan Kinniment and Tao Lin

10 Dec 2023 5:23 UTC

77 points

10 comments2 min readLW link

Conceptual coherence for concrete categories in humans and LLMs

Bill Benzon9 Dec 2023 23:49 UTC

13 points

1 comment2 min readLW link

2d ai-partners as a comprehensive motivation tool

AiresJL9 Dec 2023 21:59 UTC

3 points

0 comments1 min readLW link

Without—MicroFiction 250 words

Carissa Cassiel9 Dec 2023 21:49 UTC

19 points

1 comment1 min readLW link

Some negative steganography results

Fabien Roger9 Dec 2023 20:22 UTC

55 points

5 comments2 min readLW link

Summing up “Scheming AIs” (Section 5)

Joe Carlsmith9 Dec 2023 15:48 UTC

2 points

0 comments11 min readLW link

The Offense-Defense Balance Rarely Changes

Maxwell Tabarrok9 Dec 2023 15:21 UTC

75 points

23 comments3 min readLW link

(maximumprogress.substack.com)

A Philosophical Tautology

Nox ML9 Dec 2023 14:06 UTC

−2 points

45 comments2 min readLW link

Unpicking Extinction

ukc100149 Dec 2023 9:15 UTC

34 points

10 comments10 min readLW link

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs, Sam Mitchell and Eccentricity

9 Dec 2023 2:27 UTC

68 points

5 comments10 min readLW link

[Question] Option Space Nomenclature

SilverFlame8 Dec 2023 23:14 UTC

1 point

0 comments1 min readLW link

“Model UN Solutions”

Arjun Panickssery8 Dec 2023 23:06 UTC

36 points

5 comments1 min readLW link

(open.substack.com)

Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs”)

Joe Carlsmith8 Dec 2023 21:09 UTC

9 points

0 comments15 min readLW link

Foreacting agents

B Jacobs8 Dec 2023 19:57 UTC

4 points

0 comments13 min readLW link

Modeling incentives at scale using LLMs

Bruno Marnette, pzahn and cmck

8 Dec 2023 18:46 UTC

7 points

3 comments13 min readLW link

Refusal mechanisms: initial experiments with Llama-2-7b-chat

Andy Arditi and Oscar Obeso

8 Dec 2023 17:08 UTC

79 points

7 comments7 min readLW link

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Karolis Ramanauskas8 Dec 2023 13:18 UTC

13 points

1 comment4 min readLW link

(arxiv.org)

What I Would Do If I Were Working On AI Governance

johnswentworth8 Dec 2023 6:43 UTC

109 points

32 comments10 min readLW link

Whither Prison Abolition?

MadHatter8 Dec 2023 5:27 UTC

−7 points

0 comments16 min readLW link

(bittertruths.substack.com)

Class consciousness for those against the class system

TekhneMakre8 Dec 2023 1:02 UTC

10 points

7 comments1 min readLW link

Building selfless agents to avoid instrumental self-preservation.

blallo7 Dec 2023 18:59 UTC

14 points

2 comments6 min readLW link

Does Chat-GPT display ‘Scope Insensitivity’?

callum7 Dec 2023 18:58 UTC

11 points

0 comments3 min readLW link

LLM keys—A Proposal of a Solution to Prompt Injection Attacks

Peter Hroššo7 Dec 2023 17:36 UTC

1 point

2 comments1 min readLW link

Meetup Tip: Heartbeat Messages

Screwtape7 Dec 2023 17:18 UTC

68 points

4 comments3 min readLW link

[Valence series] 2. Valence & Normativity

Steven Byrnes7 Dec 2023 16:43 UTC

70 points

4 comments28 min readLW link

AISN #27: Defensive Accelerationism, A Retrospective On The OpenAI Board Saga, And A New AI Bill From Senators Thune And Klobuchar

aogara, Dan H, Corin Katzke and allison huang

7 Dec 2023 15:59 UTC

13 points

0 comments6 min readLW link

(newsletter.safe.ai)

AI #41: Bring in the Other Gemini

Zvi7 Dec 2023 15:10 UTC

46 points

16 comments52 min readLW link

(thezvi.wordpress.com)

Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)

Joe Carlsmith7 Dec 2023 15:05 UTC

10 points

1 comment19 min readLW link

Results from the Turing Seminar hackathon

Charbel-Raphaël, jeanne_ and WCargo

7 Dec 2023 14:50 UTC

29 points

1 comment6 min readLW link