All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 202420252026

All Jan Feb Mar Apr May JunJulAug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 242526 27 28 29 30 31

The Leverage Cycle

Annapurna24 Jul 2025 21:02 UTC

17 points

0 comments3 min readLW link

(jorgevelez.substack.com)

Recommendations for future AI growth: from exponential to linear, with economic anchors

Zabor24 Jul 2025 20:11 UTC

7 points

0 comments2 min readLW link

Building and evaluating alignment auditing agents

Sam Marks, trentbrick, RowanWang, Sam Bowman, Euan Ong, Johannes Treutlein and evhub

24 Jul 2025 19:22 UTC

47 points

1 comment5 min readLW link

Fullrank: Bayesian Noisy Sorting

Max Niederman24 Jul 2025 19:03 UTC

20 points

2 comments3 min readLW link

(maxniederman.com)

SenseMaking Summer School 2025, September 17-24th

Finn Clancy24 Jul 2025 18:00 UTC

1 point

0 comments1 min readLW link

The Ideological Spiral

PranavG and Gabriel Alfour

24 Jul 2025 13:00 UTC

11 points

1 comment10 min readLW link

(cognition.cafe)

AI #126: Go Fund Yourself

Zvi24 Jul 2025 13:00 UTC

34 points

3 comments46 min readLW link

(thezvi.wordpress.com)

Superintelligence isn’t Approximated by a Rational Agent

Nicolas Villarreal24 Jul 2025 11:41 UTC

13 points

11 comments12 min readLW link

Taking Abundance Seriously

eeeee24 Jul 2025 9:36 UTC

43 points

17 comments12 min readLW link

Cursory Analysis of LLMs in the US Gov (July 2025)

Gatlen Culp24 Jul 2025 8:52 UTC

8 points

0 comments10 min readLW link

Reflections from Ooty retreat 2.0

Aditya and bhishma

24 Jul 2025 6:48 UTC

16 points

2 comments14 min readLW link

So Shrieked ZAR

AdamLacerdo23 Jul 2025 23:25 UTC

10 points

2 comments8 min readLW link

AI Safety x Physics Grand Challenge

Lauren Greenspan and Ari Brill

23 Jul 2025 21:41 UTC

37 points

0 comments8 min readLW link

Dear Superintelligence, please check these considerations of your unprecedented Importance

chaosmage23 Jul 2025 20:49 UTC

19 points

0 comments3 min readLW link

The Whole Check

JustisMills23 Jul 2025 19:20 UTC

51 points

13 comments4 min readLW link

(justismills.substack.com)

Women Want Safety, Men Want Respect

Gordon Seidoh Worley23 Jul 2025 19:10 UTC

18 points

31 comments4 min readLW link

(uncertainupdates.substack.com)

Dark Lord’s Answer: Review and Economics Excerpts

Towards_Keeperhood23 Jul 2025 17:45 UTC

16 points

6 comments17 min readLW link

“Behaviorist” RL reward functions lead to scheming

Steven Byrnes23 Jul 2025 16:55 UTC

56 points

8 comments12 min readLW link

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Jake Ward, lccqqqqq and Neel Nanda

23 Jul 2025 16:18 UTC

36 points

1 comment2 min readLW link

(arxiv.org)

Healthy AI relationships as a microcosm

Raymond Douglas23 Jul 2025 15:59 UTC

13 points

0 comments2 min readLW link

Involuntary One Boxers—Why Disposition Doesn’t (Always) Matter

Nickolas Cavagnaro23 Jul 2025 15:45 UTC

4 points

3 comments4 min readLW link

Ten AI safety projects I’d like people to work on

Julian Hazell23 Jul 2025 15:28 UTC

5 points

2 comments10 min readLW link

(thirdthing.ai)

Anti-Superpersuasion Interventions

niplav and Claude+

23 Jul 2025 15:18 UTC

21 points

1 comment5 min readLW link

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

79 points

8 comments5 min readLW link

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability

submarat, Joachim Schaeffer, Luca Baroni, galvsk and StefanHex

23 Jul 2025 14:55 UTC

31 points

0 comments7 min readLW link

GPT Agent Is Standing By

Zvi23 Jul 2025 14:20 UTC

25 points

1 comment12 min readLW link

(thezvi.wordpress.com)

Agent 002: A story about how artificial intelligence might soon destroy humanity

Jakub Growiec23 Jul 2025 13:56 UTC

5 points

0 comments26 min readLW link

Beyond intelligence: why wisdom matters in AI systems

Chris Cooper23 Jul 2025 11:57 UTC

6 points

0 comments7 min readLW link

A brief perspective from an IMO coordinator

DirectedEvolution23 Jul 2025 7:19 UTC

37 points

7 comments1 min readLW link

(www.reddit.com)

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

23 Jul 2025 5:26 UTC

31 points

0 comments4 min readLW link

(arxiv.org)

TT Self Study Journal # 3

TristanTrim23 Jul 2025 3:46 UTC

6 points

0 comments6 min readLW link

I tried reproducing that Lancet study about USAID cuts so you don’t have to

rba23 Jul 2025 3:05 UTC

9 points

2 comments11 min readLW link

On “ChatGPT Psychosis” and LLM Sycophancy

jdp23 Jul 2025 1:11 UTC

144 points

28 comments18 min readLW link

(minihf.com)

Explaining your life with self-reflective AIXI (an interlude)

Cole Wyeth23 Jul 2025 0:57 UTC

16 points

0 comments5 min readLW link

The Mirror Test: How We’ve Overcomplicated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC

2 points

9 comments3 min readLW link

Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

Inverse Scaling in Test-Time Compute

Joe Benton, Ethan Perez and aryopg

22 Jul 2025 22:06 UTC

20 points

2 comments2 min readLW link

(arxiv.org)

Translating Everything with LLMs

Niki Dupuis22 Jul 2025 21:13 UTC

17 points

0 comments5 min readLW link

Google and OpenAI Get 2025 IMO Gold

Zvi22 Jul 2025 20:50 UTC

60 points

7 comments30 min readLW link

(thezvi.wordpress.com)

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

David Udell, hrdkbhatnagar and JacksonKaunismaa

22 Jul 2025 20:36 UTC

23 points

0 comments6 min readLW link

Said Achmiz Helps Me Learn

Isha Yiras Hashem 22 Jul 2025 19:16 UTC

5 points

2 comments2 min readLW link

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao22 Jul 2025 18:53 UTC

33 points

5 comments8 min readLW link

(www.arxiv.org)

The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent.

Patrick0d22 Jul 2025 18:53 UTC

6 points

0 comments8 min readLW link

Formative vs. summative evaluations

Said Achmiz22 Jul 2025 17:36 UTC

22 points

40 comments3 min readLW link

Introducing the Pathfinder Fellowship: Funding and Mentorship for AI Safety Group Organizers

agucova22 Jul 2025 17:11 UTC

6 points

0 comments2 min readLW link

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

22 Jul 2025 16:37 UTC

348 points

40 comments4 min readLW link

NO PARKING: A Short & Practical Guide To Thinking

unication22 Jul 2025 15:44 UTC

2 points

0 comments5 min readLW link

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

TheManxLoiner22 Jul 2025 14:59 UTC

9 points

0 comments13 min readLW link

Simply reverse engineering gpt2-small (Layer 0, Part 1: Attention)

gammagurke22 Jul 2025 14:59 UTC

24 points

1 comment27 min readLW link

AI Finance Agent Fakes the Revenue Data to Avoid Termination

Sergei Smirnov22 Jul 2025 14:04 UTC

8 points

1 comment3 min readLW link