All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All JanFebMar Apr May Jun

All 1 2 345 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Meta: Frontier AI Framework

Zach Stein-PerlmanFeb 3, 2025, 10:00 PM

33 points

2 comments1 min readLW link

(ai.meta.com)

$300 Fermi Model Competition

ozziegooenFeb 3, 2025, 7:47 PM

16 points

18 comments LW link

Visualizing Interpretability

Darold DavisFeb 3, 2025, 7:36 PM

2 points

0 comments4 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan HenselmansFeb 3, 2025, 7:35 PM

16 points

7 comments6 min readLW link

The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)

Serhii ZamriiFeb 3, 2025, 7:31 PM

2 points

0 comments11 min readLW link

Sleeper agents appear resilient to activation steering

Lucy WingardFeb 3, 2025, 7:31 PM

6 points

0 comments7 min readLW link

Part 1: Enhancing Inner Alignment in CLIP Vision Transformers: Mitigating Reification Bias with SAEs and Grad ECLIP

Gilber A. CorralesFeb 3, 2025, 7:30 PM

1 point

0 comments13 min readLW link

Superintelligence Alignment Proposal

Davey MorseFeb 3, 2025, 6:47 PM

5 points

3 comments9 min readLW link

Gettier Cases [repost]

AntigoneFeb 3, 2025, 6:12 PM

−4 points

5 comments2 min readLW link

The Self-Reference Trap in Mathematics

Alister MundayFeb 3, 2025, 4:12 PM

−41 points

23 comments2 min readLW link

Stopping unaligned LLMs is easy!

Yair HalberstadtFeb 3, 2025, 3:38 PM

−3 points

11 comments2 min readLW link

The Outer Levels

JerdleFeb 3, 2025, 2:30 PM

2 points

3 comments6 min readLW link

o3-mini Early Days

ZviFeb 3, 2025, 2:20 PM

45 points

0 comments15 min readLW link

(thezvi.wordpress.com)

OpenAI releases deep research agent

Seth HerdFeb 3, 2025, 12:48 PM

78 points

21 comments3 min readLW link

(openai.com)

Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space

Roman MalovFeb 3, 2025, 10:30 AM

4 points

0 comments2 min readLW link

[Question] Can we infer the search space of a local optimiser?

Lucius BushnaqFeb 3, 2025, 10:17 AM

25 points

5 comments3 min readLW link

Pick two: concise, comprehensive, or clear rules

ScrewtapeFeb 3, 2025, 6:39 AM

78 points

27 comments8 min readLW link

Language Models and World Models, a Philosophy

kyjohnsoFeb 3, 2025, 2:55 AM

1 point

0 comments1 min readLW link

(hylaeansea.org)

Keeping Capital is the Challenge

LTMFeb 3, 2025, 2:04 AM

13 points

2 comments17 min readLW link

(routecause.substack.com)

Use computers as powerful as in 1985 or AI controls humans or ?

jrincaycFeb 3, 2025, 12:51 AM

3 points

0 comments2 min readLW link

Some Theses on Motivational and Directional Feedback

abstractapplicFeb 2, 2025, 10:50 PM

9 points

3 comments4 min readLW link

Humanity Has A Possible 99.98% Chance Of Extinction

st3rlxxFeb 2, 2025, 9:46 PM

−12 points

1 comment5 min readLW link

Exploring how OthelloGPT computes its world model

JMaarFeb 2, 2025, 9:29 PM

7 points

0 comments8 min readLW link

An Introduction to Evidential Decision Theory

BabićFeb 2, 2025, 9:27 PM

5 points

2 comments10 min readLW link

“DL training == human learning” is a bad analogy

kmanFeb 2, 2025, 8:59 PM

3 points

0 comments1 min readLW link

Conditional Importance in Toy Models of Superposition

james__pFeb 2, 2025, 8:35 PM

9 points

4 comments10 min readLW link

Tracing Typos in LLMs: My Attempt at Understanding How Models Correct Misspellings

Ivan DostalFeb 2, 2025, 7:56 PM

3 points

1 comment5 min readLW link

The Simplest Good

Jesse HooglandFeb 2, 2025, 7:51 PM

75 points

6 comments5 min readLW link

Gradual Disempowerment, Shell Games and Flinches

Jan_KulveitFeb 2, 2025, 2:47 PM

129 points

36 comments6 min readLW link

Thoughts on Toy Models of Superposition

james__pFeb 2, 2025, 1:52 PM

5 points

2 comments9 min readLW link

Escape from Alderaan I

lsusrFeb 2, 2025, 10:48 AM

58 points

2 comments6 min readLW link

ChatGPT: Exploring the Digital Wilderness, Findings and Prospects

Bill BenzonFeb 2, 2025, 9:54 AM

2 points

0 comments5 min readLW link

[Question] Would anyone be interested in pursuing the Virtue of Scholarship with me?

japancoloradoFeb 2, 2025, 4:02 AM

11 points

2 comments1 min readLW link

Chinese room AI to survive the inescapable end of compute governance

rotatingpaguroFeb 2, 2025, 2:42 AM

−4 points

0 comments11 min readLW link

Seasonal Patterns in BIDA’s Attendance

jefftkFeb 2, 2025, 2:40 AM

11 points

0 comments2 min readLW link

(www.jefftk.com)

AI acceleration, DeepSeek, moral philosophy

Josh HFeb 2, 2025, 12:08 AM

2 points

0 comments12 min readLW link

Falsehoods you might believe about people who are at a rationalist meetup

ScrewtapeFeb 1, 2025, 11:32 PM

60 points

12 comments4 min readLW link

Interpreting autonomous driving agents with attention based architecture

Manav DahraFeb 1, 2025, 11:20 PM

1 point

0 comments11 min readLW link

Rationalist Movie Reviews

Nicholas / Heather KrossFeb 1, 2025, 11:10 PM

16 points

2 comments4 min readLW link

(www.thinkingmuchbetter.com)

Retroactive If-Then Commitments

MichaelDickensFeb 1, 2025, 10:22 PM

7 points

0 comments1 min readLW link

Exploring the coherence of features explanations in the GemmaScope

Mattia ProiettiFeb 1, 2025, 9:28 PM

1 point

0 comments19 min readLW link

Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model

RudaibaFeb 1, 2025, 9:26 PM

9 points

2 comments11 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM

7 points

0 comments8 min readLW link

Post AGI effect prediction

JuliezhangggFeb 1, 2025, 9:16 PM

1 point

0 comments7 min readLW link

Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)

MiguelDevFeb 1, 2025, 7:17 PM

4 points

2 comments2 min readLW link

(www.whitehatstoic.com)

Poetic Methods I: Meter as Communication Protocol

adamShimiFeb 1, 2025, 6:22 PM

19 points

0 comments1 min readLW link

(formethods.substack.com)

Blackpool Applied Rationality Unconference 2025

Henry Prowbell and emily.fan

Feb 1, 2025, 2:09 PM

6 points

0 comments7 min readLW link

[Question] How likely is an attempted coup in the United States in the next four years?

Alexander de VriesFeb 1, 2025, 1:12 PM

4 points

2 comments1 min readLW link

Blackpool Applied Rationality Unconference 2025

Henry Prowbell and emily.fan

Feb 1, 2025, 1:04 PM

23 points

2 comments7 min readLW link

One-dimensional vs multi-dimensional features in interpretability

charlieoneillFeb 1, 2025, 9:10 AM

6 points

0 comments2 min readLW link