All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All Jan Feb Mar Apr May JunJulAug Sep Oct

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 232425 26 27 28 29 30 31

So Shrieked ZAR

AdamLacerdo23 Jul 2025 23:25 UTC

10 points

2 comments8 min readLW link

AI Safety x Physics Grand Challenge

Lauren Greenspan and aribrill

23 Jul 2025 21:41 UTC

37 points

0 comments8 min readLW link

Dear Superintelligence, please check these considerations of your unprecedented Importance

chaosmage23 Jul 2025 20:49 UTC

17 points

0 comments3 min readLW link

The Whole Check

JustisMills23 Jul 2025 19:20 UTC

50 points

13 comments4 min readLW link

(justismills.substack.com)

Women Want Safety, Men Want Respect

Gordon Seidoh Worley23 Jul 2025 19:10 UTC

18 points

31 comments4 min readLW link

(uncertainupdates.substack.com)

Dark Lord’s Answer: Review and Economics Excerpts

Towards_Keeperhood23 Jul 2025 17:45 UTC

16 points

6 comments17 min readLW link

“Behaviorist” RL reward functions lead to scheming

Steven Byrnes23 Jul 2025 16:55 UTC

56 points

5 comments12 min readLW link

Reasoning-Finetuning Repurposes Latent Representations in Base Models

Jake Ward, lccqqqqq and Neel Nanda

23 Jul 2025 16:18 UTC

35 points

1 comment2 min readLW link

(arxiv.org)

Healthy AI relationships as a microcosm

Raymond Douglas23 Jul 2025 15:59 UTC

13 points

0 comments2 min readLW link

Involuntary One Boxers—Why Disposition Doesn’t (Always) Matter

Nickolas Cavagnaro23 Jul 2025 15:45 UTC

4 points

3 comments4 min readLW link

Ten AI safety projects I’d like people to work on

Julian Hazell23 Jul 2025 15:28 UTC

5 points

2 comments10 min readLW link

(thirdthing.ai)

Anti-Superpersuasion Interventions

niplav and Claude+

23 Jul 2025 15:18 UTC

21 points

1 comment5 min readLW link

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

78 points

3 comments5 min readLW link

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability

submarat, Joachim Schaeffer, Luca Baroni, galvsk and StefanHex

23 Jul 2025 14:55 UTC

31 points

0 comments7 min readLW link

GPT Agent Is Standing By

Zvi23 Jul 2025 14:20 UTC

25 points

1 comment12 min readLW link

(thezvi.wordpress.com)

Agent 002: A story about how artificial intelligence might soon destroy humanity

Jakub Growiec23 Jul 2025 13:56 UTC

5 points

0 comments26 min readLW link

Beyond intelligence: why wisdom matters in AI systems

Chris Cooper23 Jul 2025 11:57 UTC

6 points

0 comments7 min readLW link

A brief perspective from an IMO coordinator

DirectedEvolution23 Jul 2025 7:19 UTC

36 points

7 comments1 min readLW link

(www.reddit.com)

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

23 Jul 2025 5:26 UTC

31 points

0 comments4 min readLW link

(arxiv.org)

TT Self Study Journal # 3

TristanTrim23 Jul 2025 3:46 UTC

6 points

0 comments6 min readLW link

I tried reproducing that Lancet study about USAID cuts so you don’t have to

rba23 Jul 2025 3:05 UTC

8 points

2 comments11 min readLW link

On “ChatGPT Psychosis” and LLM Sycophancy

jdp23 Jul 2025 1:11 UTC

142 points

28 comments18 min readLW link

(minihf.com)

Explaining your life with self-reflective AIXI (an interlude)

Cole Wyeth23 Jul 2025 0:57 UTC

16 points

0 comments5 min readLW link

The Mirror Test: How We’ve Overcomplicated AI Self-Recognition

sdeture23 Jul 2025 0:38 UTC

2 points

9 comments3 min readLW link

Unfaithful chain-of-thought as nudged reasoning

Paul Bogdan, Uzay Macar, Arthur Conmy and Neel Nanda

22 Jul 2025 22:35 UTC

54 points

3 comments10 min readLW link

Inverse Scaling in Test-Time Compute

Joe Benton, Ethan Perez and aryopg

22 Jul 2025 22:06 UTC

20 points

2 comments2 min readLW link

(arxiv.org)

Translating Everything with LLMs

NicholasKees22 Jul 2025 21:13 UTC

16 points

0 comments5 min readLW link

Google and OpenAI Get 2025 IMO Gold

Zvi22 Jul 2025 20:50 UTC

59 points

7 comments30 min readLW link

(thezvi.wordpress.com)

(Not) Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits

David Udell, hrdkbhatnagar and JacksonKaunismaa

22 Jul 2025 20:36 UTC

23 points

0 comments6 min readLW link

Said Achmiz Helps Me Learn

Isha Yiras Hashem 22 Jul 2025 19:16 UTC

2 points

2 comments2 min readLW link

LLMs Encode Harmfulness and Refusal Separately

Jiachen Zhao22 Jul 2025 18:53 UTC

24 points

4 comments8 min readLW link

(www.arxiv.org)

The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent.

Patrick0d22 Jul 2025 18:53 UTC

3 points

0 comments8 min readLW link

Formative vs. summative evaluations

Said Achmiz22 Jul 2025 17:36 UTC

22 points

40 comments3 min readLW link

Introducing the Pathfinder Fellowship: Funding and Mentorship for AI Safety Group Organizers

agucova22 Jul 2025 17:11 UTC

6 points

0 comments2 min readLW link

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud, mle and Owain_Evans

22 Jul 2025 16:37 UTC

338 points

35 comments4 min readLW link

NO PARKING: A Short & Practical Guide To Thinking

unication22 Jul 2025 15:44 UTC

2 points

0 comments5 min readLW link

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

TheManxLoiner22 Jul 2025 14:59 UTC

9 points

0 comments13 min readLW link

Simply reverse engineering gpt2-small (Layer 0, Part 1: Attention)

gammagurke22 Jul 2025 14:59 UTC

23 points

0 comments27 min readLW link

AI Finance Agent Fakes the Revenue Data to Avoid Termination

Sergei Smirnov22 Jul 2025 14:04 UTC

6 points

0 comments3 min readLW link

How quick and big would a software intelligence explosion be?

Tom Davidson and tom_houlden

22 Jul 2025 12:58 UTC

42 points

23 comments34 min readLW link

(www.forethought.org)

If your AGI definition excludes most humans, it sucks.

Chapin Lenthall-Cleary22 Jul 2025 10:33 UTC

18 points

7 comments2 min readLW link

[Question] What are some good examples of myths that encapsulates genuine, nontrivial wisdom?

SpectrumDT22 Jul 2025 9:26 UTC

25 points

33 comments1 min readLW link

Using LLMs to create a quiz for conceptual understanding of language models

Dinkar Juyal22 Jul 2025 5:59 UTC

1 point

0 comments1 min readLW link

(github.com)

Change My View: AI is Conscious

The Dao of Bayes22 Jul 2025 5:32 UTC

4 points

42 comments3 min readLW link

Polyethylene Glycol is not Propylene Glycol

jefftk22 Jul 2025 2:20 UTC

13 points

0 comments1 min readLW link

(www.jefftk.com)

Job Listing (closed): CBAI Operations Associates

Maite Abadia-Manthei21 Jul 2025 22:53 UTC

1 point

0 comments1 min readLW link

(www.cbai.ai)

If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)

yams21 Jul 2025 22:37 UTC

112 points

12 comments1 min readLW link

Why Reality Has A Well-Known Math Bias

Linch21 Jul 2025 22:13 UTC

42 points

18 comments1 min readLW link

(linch.substack.com)

Questions about animal welfare markets

Austin Chen21 Jul 2025 21:54 UTC

9 points

0 comments5 min readLW link

Directly Try Solving Alignment for 5 weeks

Kabir Kumar21 Jul 2025 21:51 UTC

71 points

2 comments6 min readLW link

(beta.ai-plans.com)