All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All Jan FebMarApr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 181920 21 22 23 24 25 26 27 28 29 30 31

What I am working on right now and why: representation engineering edition

Lukasz G Bartoszcze18 Mar 2025 22:37 UTC

3 points

0 comments3 min readLW link

Boots theory and Sybil Ramkin

philh18 Mar 2025 22:10 UTC

37 points

18 comments11 min readLW link

(reasonableapproximation.net)

Schmidt Sciences Technical AI Safety RFP on Inference-Time Compute – Deadline: April 30

Ryan Gajarawala18 Mar 2025 18:05 UTC

18 points

0 comments2 min readLW link

(www.schmidtsciences.org)

PRISM: Perspective Reasoning for Integrated Synthesis and Mediation (Interactive Demo)

Anthony Diamond18 Mar 2025 18:03 UTC

10 points

2 comments1 min readLW link

Subspace Rerouting: Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Le magicien quantique18 Mar 2025 17:55 UTC

6 points

1 comment10 min readLW link

Progress links and short notes, 2025-03-18

jasoncrawford18 Mar 2025 17:14 UTC

8 points

0 comments3 min readLW link

(newsletter.rootsofprogress.org)

The Convergent Path to the Stars

Maxime Riché18 Mar 2025 17:09 UTC

6 points

0 comments20 min readLW link

Sapir-Whorf Ego Death

Jonathan Moregård18 Mar 2025 16:57 UTC

8 points

7 comments2 min readLW link

(honestliving.substack.com)

Smelling Nice is Good, Actually

Gordon Seidoh Worley18 Mar 2025 16:54 UTC

28 points

8 comments3 min readLW link

(uncertainupdates.substack.com)

A Taxonomy of Jobs Deeply Resistant to TAI Automation

Deric Cheng18 Mar 2025 16:25 UTC

9 points

0 comments12 min readLW link

(www.convergenceanalysis.org)

Why Are The Human Sciences Hard? Two New Hypotheses

Aydin Mohseni, Daniel Herrmann and ben_levinstein

18 Mar 2025 15:45 UTC

39 points

14 comments9 min readLW link

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong and rgorman

18 Mar 2025 14:48 UTC

80 points

12 comments5 min readLW link

[Question] What is the theory of change behind writing papers about AI safety?

Kajus18 Mar 2025 12:51 UTC

7 points

1 comment1 min readLW link

OpenAI #11: America Action Plan

Zvi18 Mar 2025 12:50 UTC

83 points

3 comments6 min readLW link

(thezvi.wordpress.com)

I changed my mind about orca intelligence

Towards_Keeperhood18 Mar 2025 10:15 UTC

54 points

24 comments5 min readLW link

[Question] Is Peano arithmetic trying to kill us? Do we care?

Q Home18 Mar 2025 8:22 UTC

17 points

2 comments2 min readLW link

Do What the Mammals Do

CrimsonChin18 Mar 2025 3:57 UTC

2 points

6 comments4 min readLW link

What Actually Matters Until We Reach the Singularity

Lexius18 Mar 2025 2:17 UTC

−1 points

0 comments9 min readLW link

Meaning as a cognitive substitute for survival instincts: A thought experiment

Ovidijus Šimkus18 Mar 2025 1:53 UTC

0 points

0 comments2 min readLW link

Against Yudkowsky’s evolution analogy for AI x-risk [unfinished]

Fiora Sunshine18 Mar 2025 1:41 UTC

52 points

18 comments11 min readLW link

An “AI researcher” has written a paper on optimizing AI architecture and optimized a language model to several orders of magnitude more efficiency.

Y B18 Mar 2025 1:15 UTC

3 points

1 comment1 min readLW link

LessOnline 2025: Early Bird Tickets On Sale

Ben Pace18 Mar 2025 0:22 UTC

37 points

5 comments5 min readLW link

Feedback loops for exercise (VO2Max)

Elizabeth18 Mar 2025 0:10 UTC

65 points

13 comments8 min readLW link

(acesounderglass.com)

FrontierMath Score of o3-mini Much Lower Than Claimed

YafahEdelman17 Mar 2025 22:41 UTC

61 points

7 comments1 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

17 Mar 2025 22:27 UTC

27 points

0 comments11 min readLW link

Effectively Communicating with DC Policymakers

PolicyTakes17 Mar 2025 22:11 UTC

14 points

0 comments2 min readLW link

EIS XV: A New Proof of Concept for Useful Interpretability

scasper17 Mar 2025 20:05 UTC

30 points

2 comments3 min readLW link

Sentinel’s Global Risks Weekly Roundup #11/2025. Trump invokes Alien Enemies Act, Chinese invasion barges deployed in exercise.

NunoSempere17 Mar 2025 19:34 UTC

59 points

3 comments6 min readLW link

(blog.sentinel-team.org)

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer and Marius Hobbhahn

17 Mar 2025 19:11 UTC

188 points

9 comments6 min readLW link

Three Types of Intelligence Explosion

rosehadshar, Tom Davidson and wdmacaskill

17 Mar 2025 14:47 UTC

40 points

8 comments3 min readLW link

(www.forethought.org)

An Advent of Thought

Kaarel17 Mar 2025 14:21 UTC

57 points

13 comments48 min readLW link

Interested in working from a new Boston AI Safety Hub?

agucova and Topaz

17 Mar 2025 13:42 UTC

17 points

0 comments2 min readLW link

Other Civilizations Would Recover 84+% of Our Cosmic Resources—A Challenge to Extinction Risk Prioritization

Maxime Riché17 Mar 2025 13:12 UTC

5 points

0 comments12 min readLW link

Monthly Roundup #28: March 2025

Zvi17 Mar 2025 12:50 UTC

31 points

8 comments14 min readLW link

(thezvi.wordpress.com)

Are corporations superintelligent?

Vishakha and Algon

17 Mar 2025 10:36 UTC

3 points

3 comments1 min readLW link

(aisafety.info)

One pager

samuelshadrach17 Mar 2025 8:12 UTC

6 points

2 comments8 min readLW link

(samuelshadrach.com)

The Case for AI Optimism

Annapurna17 Mar 2025 1:29 UTC

−6 points

1 comment1 min readLW link

(nationalaffairs.com)

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Roland Pihlakas, Sruthi Kuriakose and shrutidattagupta

16 Mar 2025 23:23 UTC

45 points

8 comments13 min readLW link

What would a post labor economy actually look like?

Ansh Juneja16 Mar 2025 20:38 UTC

3 points

2 comments17 min readLW link

Why White-Box Redteaming Makes Me Feel Weird

Zygi Straznickas16 Mar 2025 18:54 UTC

206 points

36 comments3 min readLW link

How I’ve run major projects

benkuhn16 Mar 2025 18:40 UTC

127 points

10 comments8 min readLW link

(www.benkuhn.net)

Counting Objections to Housing

jefftk16 Mar 2025 18:20 UTC

13 points

7 comments3 min readLW link

(www.jefftk.com)

I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

shrimpy16 Mar 2025 16:52 UTC

161 points

26 comments1 min readLW link

Siberian Arctic origins of East Asian psychology

David Sun16 Mar 2025 16:52 UTC

6 points

0 comments1 min readLW link

AI Model History is Being Lost

Vale16 Mar 2025 12:38 UTC

19 points

1 comment1 min readLW link

(vale.rocks)

Metacognition Broke My Nail-Biting Habit

Rafka16 Mar 2025 12:36 UTC

45 points

20 comments2 min readLW link

[Question] Can we ever ensure AI alignment if we can only test AI personas?

Karl von Wendt16 Mar 2025 8:06 UTC

22 points

8 comments1 min readLW link

Can time preferences make AI safe?

TerriLeaf15 Mar 2025 21:41 UTC

2 points

1 comment2 min readLW link

Help make the orca language experiment happen

Towards_Keeperhood15 Mar 2025 21:39 UTC

9 points

12 comments5 min readLW link