Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Page
1
Meta: Frontier AI Framework
Zach Stein-Perlman
Feb 3, 2025, 10:00 PM
33
points
2
comments
1
min read
LW
link
(ai.meta.com)
$300 Fermi Model Competition
ozziegooen
Feb 3, 2025, 7:47 PM
16
points
18
comments
LW
link
Visualizing Interpretability
Darold Davis
Feb 3, 2025, 7:36 PM
2
points
0
comments
4
min read
LW
link
Alignment Can Reduce Performance on Simple Ethical Questions
Daan Henselmans
Feb 3, 2025, 7:35 PM
16
points
7
comments
6
min read
LW
link
The Overlap Paradigm: Rethinking Data’s Role in Weak-to-Strong Generalization (W2SG)
Serhii Zamrii
Feb 3, 2025, 7:31 PM
2
points
0
comments
11
min read
LW
link
Sleeper agents appear resilient to activation steering
Lucy Wingard
Feb 3, 2025, 7:31 PM
6
points
0
comments
7
min read
LW
link
Part 1: Enhancing Inner Alignment in CLIP Vision Transformers: Mitigating Reification Bias with SAEs and Grad ECLIP
Gilber A. Corrales
Feb 3, 2025, 7:30 PM
1
point
0
comments
13
min read
LW
link
Superintelligence Alignment Proposal
Davey Morse
Feb 3, 2025, 6:47 PM
5
points
3
comments
9
min read
LW
link
Gettier Cases [repost]
Antigone
Feb 3, 2025, 6:12 PM
−4
points
5
comments
2
min read
LW
link
The Self-Reference Trap in Mathematics
Alister Munday
Feb 3, 2025, 4:12 PM
−41
points
23
comments
2
min read
LW
link
Stopping unaligned LLMs is easy!
Yair Halberstadt
Feb 3, 2025, 3:38 PM
−3
points
11
comments
2
min read
LW
link
The Outer Levels
Jerdle
Feb 3, 2025, 2:30 PM
2
points
3
comments
6
min read
LW
link
o3-mini Early Days
Zvi
Feb 3, 2025, 2:20 PM
45
points
0
comments
15
min read
LW
link
(thezvi.wordpress.com)
OpenAI releases deep research agent
Seth Herd
Feb 3, 2025, 12:48 PM
78
points
21
comments
3
min read
LW
link
(openai.com)
Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space
Roman Malov
Feb 3, 2025, 10:30 AM
4
points
0
comments
2
min read
LW
link
[Question]
Can we infer the search space of a local optimiser?
Lucius Bushnaq
Feb 3, 2025, 10:17 AM
25
points
5
comments
3
min read
LW
link
Pick two: concise, comprehensive, or clear rules
Screwtape
Feb 3, 2025, 6:39 AM
78
points
27
comments
8
min read
LW
link
Language Models and World Models, a Philosophy
kyjohnso
Feb 3, 2025, 2:55 AM
1
point
0
comments
1
min read
LW
link
(hylaeansea.org)
Keeping Capital is the Challenge
LTM
Feb 3, 2025, 2:04 AM
13
points
2
comments
17
min read
LW
link
(routecause.substack.com)
Use computers as powerful as in 1985 or AI controls humans or ?
jrincayc
Feb 3, 2025, 12:51 AM
3
points
0
comments
2
min read
LW
link
Some Theses on Motivational and Directional Feedback
abstractapplic
Feb 2, 2025, 10:50 PM
9
points
3
comments
4
min read
LW
link
Humanity Has A Possible 99.98% Chance Of Extinction
st3rlxx
Feb 2, 2025, 9:46 PM
−12
points
1
comment
5
min read
LW
link
Exploring how OthelloGPT computes its world model
JMaar
Feb 2, 2025, 9:29 PM
7
points
0
comments
8
min read
LW
link
An Introduction to Evidential Decision Theory
Babić
Feb 2, 2025, 9:27 PM
5
points
2
comments
10
min read
LW
link
“DL training == human learning” is a bad analogy
kman
Feb 2, 2025, 8:59 PM
3
points
0
comments
1
min read
LW
link
Conditional Importance in Toy Models of Superposition
james__p
Feb 2, 2025, 8:35 PM
9
points
4
comments
10
min read
LW
link
Tracing Typos in LLMs: My Attempt at Understanding How Models Correct Misspellings
Ivan Dostal
Feb 2, 2025, 7:56 PM
3
points
1
comment
5
min read
LW
link
The Simplest Good
Jesse Hoogland
Feb 2, 2025, 7:51 PM
75
points
6
comments
5
min read
LW
link
Gradual Disempowerment, Shell Games and Flinches
Jan_Kulveit
Feb 2, 2025, 2:47 PM
129
points
36
comments
6
min read
LW
link
Thoughts on Toy Models of Superposition
james__p
Feb 2, 2025, 1:52 PM
5
points
2
comments
9
min read
LW
link
Escape from Alderaan I
lsusr
Feb 2, 2025, 10:48 AM
58
points
2
comments
6
min read
LW
link
ChatGPT: Exploring the Digital Wilderness, Findings and Prospects
Bill Benzon
Feb 2, 2025, 9:54 AM
2
points
0
comments
5
min read
LW
link
[Question]
Would anyone be interested in pursuing the Virtue of Scholarship with me?
japancolorado
Feb 2, 2025, 4:02 AM
11
points
2
comments
1
min read
LW
link
Chinese room AI to survive the inescapable end of compute governance
rotatingpaguro
Feb 2, 2025, 2:42 AM
−4
points
0
comments
11
min read
LW
link
Seasonal Patterns in BIDA’s Attendance
jefftk
Feb 2, 2025, 2:40 AM
11
points
0
comments
2
min read
LW
link
(www.jefftk.com)
AI acceleration, DeepSeek, moral philosophy
Josh H
Feb 2, 2025, 12:08 AM
2
points
0
comments
12
min read
LW
link
Falsehoods you might believe about people who are at a rationalist meetup
Screwtape
Feb 1, 2025, 11:32 PM
60
points
12
comments
4
min read
LW
link
Interpreting autonomous driving agents with attention based architecture
Manav Dahra
Feb 1, 2025, 11:20 PM
1
point
0
comments
11
min read
LW
link
Rationalist Movie Reviews
Nicholas / Heather Kross
Feb 1, 2025, 11:10 PM
16
points
2
comments
4
min read
LW
link
(www.thinkingmuchbetter.com)
Retroactive If-Then Commitments
MichaelDickens
Feb 1, 2025, 10:22 PM
7
points
0
comments
1
min read
LW
link
Exploring the coherence of features explanations in the GemmaScope
Mattia Proietti
Feb 1, 2025, 9:28 PM
1
point
0
comments
19
min read
LW
link
Machine Unlearning in Large Language Models: A Comprehensive Survey with Empirical Insights from the Qwen 1.5 1.8B Model
Rudaiba
Feb 1, 2025, 9:26 PM
9
points
2
comments
11
min read
LW
link
Towards a Science of Evals for Sycophancy
andrejfsantos
Feb 1, 2025, 9:17 PM
7
points
0
comments
8
min read
LW
link
Post AGI effect prediction
Juliezhanggg
Feb 1, 2025, 9:16 PM
1
point
0
comments
7
min read
LW
link
Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)
MiguelDev
Feb 1, 2025, 7:17 PM
4
points
2
comments
2
min read
LW
link
(www.whitehatstoic.com)
Poetic Methods I: Meter as Communication Protocol
adamShimi
Feb 1, 2025, 6:22 PM
19
points
0
comments
1
min read
LW
link
(formethods.substack.com)
Blackpool Applied Rationality Unconference 2025
Henry Prowbell
and
emily.fan
Feb 1, 2025, 2:09 PM
6
points
0
comments
7
min read
LW
link
[Question]
How likely is an attempted coup in the United States in the next four years?
Alexander de Vries
Feb 1, 2025, 1:12 PM
4
points
2
comments
1
min read
LW
link
Blackpool Applied Rationality Unconference 2025
Henry Prowbell
and
emily.fan
Feb 1, 2025, 1:04 PM
23
points
2
comments
7
min read
LW
link
One-dimensional vs multi-dimensional features in interpretability
charlieoneill
Feb 1, 2025, 9:10 AM
6
points
0
comments
2
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel