All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 20242025

All JanFebMar Apr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 3 456 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Nick Land: Orthogonality

lumpenspace4 Feb 2025 21:07 UTC

5 points

37 comments8 min readLW link

What working on AI safety taught me about B2B SaaS sales

purple fire4 Feb 2025 20:50 UTC

7 points

12 comments5 min readLW link

Subjective Naturalism in Decision Theory: Savage vs. Jeffrey–Bolker

Daniel Herrmann, Aydin Mohseni and ben_levinstein

4 Feb 2025 20:34 UTC

45 points

22 comments5 min readLW link

Anti-Slop Interventions?

abramdemski4 Feb 2025 19:50 UTC

76 points

33 comments6 min readLW link

Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

Devina Jain4 Feb 2025 19:10 UTC

9 points

0 comments10 min readLW link

[Question] Journalism student looking for sources

pinkerton4 Feb 2025 18:58 UTC

11 points

3 comments1 min readLW link

We’re in Deep Research

Zvi4 Feb 2025 17:20 UTC

45 points

3 comments20 min readLW link

(thezvi.wordpress.com)

The Capitalist Agent

henophilia4 Feb 2025 15:32 UTC

1 point

10 comments3 min readLW link

(blog.hermesloom.org)

Forecasting AGI: Insights from Prediction Markets and Metaculus

Alvin Ånestrand4 Feb 2025 13:03 UTC

13 points

0 comments4 min readLW link

(forecastingaifutures.substack.com)

Ruling Out Lookup Tables

Alfred Harwood4 Feb 2025 10:39 UTC

22 points

11 comments7 min readLW link

Half-baked idea: a straightforward method for learning environmental goals?

Q Home4 Feb 2025 6:56 UTC

16 points

7 comments5 min readLW link

Information Versus Action

Screwtape4 Feb 2025 5:13 UTC

31 points

0 comments6 min readLW link

Utilitarian AI Alignment: Building a Moral Assistant with the Constitutional AI Method

Clément L4 Feb 2025 4:15 UTC

6 points

1 comment13 min readLW link

Tear Down the Burren

jefftk4 Feb 2025 3:40 UTC

45 points

2 comments2 min readLW link

(www.jefftk.com)

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

Archimedes4 Feb 2025 2:55 UTC

17 points

1 comment1 min readLW link

(www.anthropic.com)

Can someone, anyone, make superintelligence a more concrete concept?

Ori Nagel4 Feb 2025 2:18 UTC

2 points

8 comments5 min readLW link

What are the “no free lunch” theorems?

Vishakha and Algon

4 Feb 2025 2:02 UTC

19 points

4 comments1 min readLW link

(aisafety.info)

eliminating bias through language?

KvmanThinking4 Feb 2025 1:52 UTC

1 point

12 comments1 min readLW link

New Foresight Longevity Bio & Molecular Nano Grants Program

Allison Duettmann4 Feb 2025 0:28 UTC

12 points

0 comments1 min readLW link

Meta: Frontier AI Framework

Zach Stein-Perlman3 Feb 2025 22:00 UTC

33 points

2 comments1 min readLW link

(ai.meta.com)

$300 Fermi Model Competition

ozziegooen3 Feb 2025 19:47 UTC

16 points

18 comments2 min readLW link

Visualizing Interpretability

Darold Davis3 Feb 2025 19:36 UTC

3 points

0 comments4 min readLW link

Alignment Can Reduce Performance on Simple Ethical Questions

Daan Henselmans3 Feb 2025 19:35 UTC

16 points

7 comments6 min readLW link

The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)

Serhii Zamrii3 Feb 2025 19:31 UTC

2 points

0 comments11 min readLW link

Sleeper agents appear resilient to activation steering

Lucy Wingard3 Feb 2025 19:31 UTC

6 points

0 comments7 min readLW link

Part 1: Enhancing Inner Alignment in CLIP Vision Transformers: Mitigating Reification Bias with SAEs and Grad ECLIP

Gilber A. Corrales3 Feb 2025 19:30 UTC

1 point

0 comments13 min readLW link

Superintelligence Alignment Proposal

Davey Morse3 Feb 2025 18:47 UTC

5 points

3 comments9 min readLW link

The Self-Reference Trap in Mathematics

Alister Munday3 Feb 2025 16:12 UTC

−41 points

23 comments2 min readLW link

Stopping unaligned LLMs is easy!

Yair Halberstadt3 Feb 2025 15:38 UTC

−3 points

11 comments2 min readLW link

The Outer Levels

Jerdle3 Feb 2025 14:30 UTC

2 points

3 comments6 min readLW link

o3-mini Early Days

Zvi3 Feb 2025 14:20 UTC

45 points

0 comments15 min readLW link

(thezvi.wordpress.com)

OpenAI releases deep research agent

Seth Herd3 Feb 2025 12:48 UTC

78 points

21 comments3 min readLW link

(openai.com)

Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space

Roman Malov3 Feb 2025 10:30 UTC

5 points

0 comments2 min readLW link

[Question] Can we infer the search space of a local optimiser?

Lucius Bushnaq3 Feb 2025 10:17 UTC

25 points

5 comments3 min readLW link

Pick two: concise, comprehensive, or clear rules

Screwtape3 Feb 2025 6:39 UTC

84 points

27 comments8 min readLW link

Language Models and World Models, a Philosophy

kyjohnso3 Feb 2025 2:55 UTC

1 point

0 comments1 min readLW link

(hylaeansea.org)

Keeping Capital is the Challenge

LTM3 Feb 2025 2:04 UTC

13 points

2 comments17 min readLW link

(routecause.substack.com)

Use computers as powerful as in 1985 or AI controls humans or ?

jrincayc3 Feb 2025 0:51 UTC

3 points

0 comments2 min readLW link

Some Theses on Motivational and Directional Feedback

abstractapplic2 Feb 2025 22:50 UTC

10 points

3 comments4 min readLW link

Humanity Has A Possible 99.98% Chance Of Extinction

st3rlxx2 Feb 2025 21:46 UTC

−12 points

1 comment5 min readLW link

Exploring how OthelloGPT computes its world model

JMaar2 Feb 2025 21:29 UTC

8 points

0 comments8 min readLW link

An Introduction to Evidential Decision Theory

Babić2 Feb 2025 21:27 UTC

5 points

2 comments10 min readLW link

“DL training == human learning” is a bad analogy

kman2 Feb 2025 20:59 UTC

3 points

0 comments1 min readLW link

Conditional Importance in Toy Models of Superposition

james__p2 Feb 2025 20:35 UTC

9 points

4 comments10 min readLW link

Tracing Typos in LLMs: My Attempt at Understanding How Models Correct Misspellings

Ivan Dostal2 Feb 2025 19:56 UTC

4 points

1 comment5 min readLW link

The Simplest Good

Jesse Hoogland2 Feb 2025 19:51 UTC

76 points

6 comments5 min readLW link

Gradual Disempowerment, Shell Games and Flinches

Jan_Kulveit2 Feb 2025 14:47 UTC

145 points

36 comments6 min readLW link

Thoughts on Toy Models of Superposition

james__p2 Feb 2025 13:52 UTC

5 points

2 comments9 min readLW link

Escape from Alderaan I

lsusr2 Feb 2025 10:48 UTC

59 points

2 comments6 min readLW link

ChatGPT: Exploring the Digital Wilderness, Findings and Prospects

Bill Benzon2 Feb 2025 9:54 UTC

2 points

0 comments5 min readLW link