All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 202320242025 2026

All Jan Feb Mar Apr May Jun Jul AugSepOct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 2930

the case for CoT unfaithfulness is overstated

nostalgebraist29 Sep 2024 22:07 UTC

272 points

45 comments11 min readLW link 1 review

0.836 Bits of Evidence In Favor of Futarchy

niplav and Claude+

29 Sep 2024 21:57 UTC

39 points

0 comments3 min readLW link

Pomodoro Method Randomized Self Experiment

niplav29 Sep 2024 21:55 UTC

16 points

2 comments1 min readLW link

Toy Models of Superposition: Simplified by Hand

Axel Sorensen29 Sep 2024 21:19 UTC

9 points

3 comments8 min readLW link

LLMs are likely not conscious

research_prime_space29 Sep 2024 20:57 UTC

6 points

9 comments1 min readLW link

A Policy Proposal

phdead29 Sep 2024 20:45 UTC

10 points

4 comments4 min readLW link

Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?

Taras Kutsyk, Tommaso Mencattini and Ciprian Florea

29 Sep 2024 19:37 UTC

28 points

8 comments25 min readLW link

Models of life

Abhishaike Mahajan29 Sep 2024 19:24 UTC

8 points

0 comments16 min readLW link

(www.asimov.press)

Interpreting the effects of Jailbreak Prompts in LLMs

Harsh Raj29 Sep 2024 19:01 UTC

9 points

0 comments5 min readLW link

New Capabilities, New Risks? - Evaluating Agentic General Assistants using Elements of GAIA & METR Frameworks

Tej Lander29 Sep 2024 18:58 UTC

5 points

0 comments29 min readLW link

Developmental Stages in Multi-Problem Grokking

James Sullivan29 Sep 2024 18:58 UTC

5 points

0 comments6 min readLW link

A Psychoanalytic Explanation of Sam Altman’s Irrational Actions

Gabe29 Sep 2024 18:58 UTC

1 point

3 comments3 min readLW link

Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation

Antonio Clarke29 Sep 2024 18:48 UTC

6 points

0 comments23 min readLW link

Cryonics is free

Mati_Roy29 Sep 2024 17:58 UTC

216 points

48 comments2 min readLW link

Runner’s High On Demand: A Story of Luck & Persistence

Shoshannah Tekofsky29 Sep 2024 17:15 UTC

14 points

6 comments5 min readLW link

(shoshanigans.substack.com)

You can, in fact, bamboozle an unaligned AI into sparing your life

David Matolcsi29 Sep 2024 16:59 UTC

127 points

175 comments27 min readLW link

Base LLMs refuse too

Connor Kissane, robertzk, Arthur Conmy and Neel Nanda

29 Sep 2024 16:04 UTC

61 points

20 comments10 min readLW link

My Methodological Turn

adamShimi29 Sep 2024 15:01 UTC

29 points

0 comments1 min readLW link

(formethods.substack.com)

Linkpost: Hypocrisy standoff

Chris_Leong29 Sep 2024 14:27 UTC

5 points

1 comment1 min readLW link

(x.com)

[Question] Any real toeholds for making practical decisions regarding AI safety?

lemonhope29 Sep 2024 12:03 UTC

27 points

6 comments1 min readLW link

Review: Dr Stone

ProgramCrafter29 Sep 2024 10:35 UTC

18 points

9 comments4 min readLW link

AXRP Episode 36 - Adam Shai and Paul Riechers on Computational Mechanics

DanielFilan29 Sep 2024 5:50 UTC

26 points

0 comments55 min readLW link

DunCon @Lighthaven

Duncan Sabien (Inactive)29 Sep 2024 4:56 UTC

46 points

2 comments1 min readLW link

Exploring Shard-like Behavior: Empirical Insights into Contextual Decision-Making in RL Agents

Alejandro Aristizabal29 Sep 2024 0:32 UTC

6 points

0 comments15 min readLW link

Jailbreaking language models with user roleplay

loops28 Sep 2024 23:43 UTC

9 points

0 comments3 min readLW link

(iter.ca)

“Slow” takeoff is a terrible term for “maybe even faster takeoff, actually”

Raemon28 Sep 2024 23:38 UTC

223 points

70 comments1 min readLW link 1 review

Contextual Constitutional AI

aksh-n28 Sep 2024 23:24 UTC

16 points

2 comments12 min readLW link

Explore More: A Bag of Tricks to Keep Your Life on the Rails

Shoshannah Tekofsky28 Sep 2024 21:38 UTC

248 points

20 comments11 min readLW link 1 review

(shoshanigans.substack.com)

2024 Petrov Day Retrospective

Ben Pace and Raemon

28 Sep 2024 21:30 UTC

95 points

25 comments10 min readLW link

[Question] Any Trump Supporters Want to Dialogue?

k6428 Sep 2024 19:41 UTC

15 points

92 comments1 min readLW link

Evaluating LLaMA 3 for political sycophancy

alma.liezenga28 Sep 2024 19:02 UTC

2 points

2 comments6 min readLW link

Two new datasets for evaluating political sycophancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC

9 points

0 comments9 min readLW link

COT Scaling implies slower takeoff speeds

Logan Zoellner28 Sep 2024 16:20 UTC

36 points

56 comments1 min readLW link

Thoughts on Evo-Bio Math and Mesa-Optimization: Maybe We Need To Think Harder About “Relative” Fitness?

Lorec28 Sep 2024 14:07 UTC

6 points

6 comments1 min readLW link

Steering LLMs’ Behavior with Concept Activation Vectors

Ruixuan Huang28 Sep 2024 9:53 UTC

9 points

0 comments10 min readLW link

An Interactive Shapley Value Explainer

James Stephen Brown28 Sep 2024 5:01 UTC

42 points

9 comments1 min readLW link

(nonzerosum.games)

[Question] Implications of China’s recession on AGI development?

Eric Neyman28 Sep 2024 1:12 UTC

41 points

4 comments1 min readLW link

The Compute Conundrum: AI Governance in a Shifting Geopolitical Era

octavo28 Sep 2024 1:05 UTC

−3 points

1 comment17 min readLW link

‘Chat with impactful research & evaluations’ (Unjournal NotebookLMs)

david reinstein28 Sep 2024 0:32 UTC

6 points

0 comments2 min readLW link

Where is the Learn Everything System?

Shoshannah Tekofsky27 Sep 2024 21:30 UTC

16 points

8 comments4 min readLW link

(thinkfeelplay.substack.com)

An “Observatory” For a Shy Super AI?

Sherrinford27 Sep 2024 21:22 UTC

5 points

0 comments1 min readLW link

(robreid.substack.com)

[Question] Searching for Impossibility Results or No-Go Theorems for provable safety.

Maelstrom27 Sep 2024 20:12 UTC

2 points

1 comment1 min readLW link

What is Randomness?

martinkunev27 Sep 2024 17:49 UTC

11 points

2 comments10 min readLW link

The Geometry of Feelings and Nonsense in Large Language Models

7vik and Nandi

27 Sep 2024 17:49 UTC

62 points

10 comments4 min readLW link

Avoiding jailbreaks by discouraging their representation in activation space

Guido Bergman27 Sep 2024 17:49 UTC

8 points

2 comments9 min readLW link

[Question] Why is o1 so deceptive?

abramdemski27 Sep 2024 17:27 UTC

185 points

24 comments3 min readLW link

The Offense-Defense Balance of Gene Drives

Maxwell Tabarrok27 Sep 2024 16:47 UTC

23 points

1 comment4 min readLW link

(www.maximum-progress.com)

Book Review: On the Edge: The Future

Zvi27 Sep 2024 14:00 UTC

61 points

1 comment49 min readLW link

(thezvi.wordpress.com)

[Question] Is cybercrime really costing trillions per year?

Fabien Roger27 Sep 2024 8:44 UTC

66 points

28 comments1 min readLW link

Australian AI Safety Forum 2024

Liam Carroll and Daniel Murfet

27 Sep 2024 0:40 UTC

42 points

0 comments2 min readLW link