Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
Freedom and Privacy of Thought Architectures
SebastianG
Jul 20, 2024, 9:43 PM
5
points
2
comments
1
min read
LW
link
Introduction to Modern Dating: Strategic Dating Advice for beginners
Jesper Lindholm
Jul 20, 2024, 3:45 PM
6
points
6
comments
13
min read
LW
link
Why Georgism Lost Its Popularity
Zero Contradictions
Jul 20, 2024, 3:08 PM
45
points
54
comments
1
min read
LW
link
(zerocontradictions.net)
Only Fools Avoid Hindsight Bias
Kevin Dorst
Jul 20, 2024, 1:42 PM
−11
points
5
comments
6
min read
LW
link
(kevindorst.substack.com)
A more systematic case for inner misalignment
Richard_Ngo
Jul 20, 2024, 5:03 AM
31
points
4
comments
5
min read
LW
link
BatchTopK: A Simple Improvement for TopK-SAEs
Bart Bussmann
,
Patrick Leask
and
Neel Nanda
Jul 20, 2024, 2:20 AM
61
points
0
comments
4
min read
LW
link
Krona Compare
jefftk
Jul 20, 2024, 1:10 AM
10
points
0
comments
2
min read
LW
link
(www.jefftk.com)
(Approximately) Deterministic Natural Latents
johnswentworth
and
David Lorell
Jul 19, 2024, 11:02 PM
42
points
1
comment
4
min read
LW
link
Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Lidor Banuel Dabbah
and
Aviel Boag
Jul 19, 2024, 8:32 PM
59
points
6
comments
16
min read
LW
link
JumpReLU SAEs + Early Access to Gemma 2 SAEs
Senthooran Rajamanoharan
,
Tom Lieberum
,
nps29
,
Arthur Conmy
,
Vikrant Varma
,
János Kramár
and
Neel Nanda
Jul 19, 2024, 4:10 PM
49
points
10
comments
1
min read
LW
link
(storage.googleapis.com)
Truth is Universal: Robust Detection of Lies in LLMs
Lennart Buerger
Jul 19, 2024, 2:07 PM
24
points
3
comments
2
min read
LW
link
(arxiv.org)
Sustainability of Digital Life Form Societies
Hiroshi Yamakawa
Jul 19, 2024, 1:59 PM
19
points
1
comment
20
min read
LW
link
Romae Industriae
Maxwell Tabarrok
Jul 19, 2024, 1:03 PM
34
points
2
comments
7
min read
LW
link
(www.maximum-progress.com)
[Question]
Have people given up on iterated distillation and amplification?
Chris_Leong
Jul 19, 2024, 12:23 PM
20
points
1
comment
1
min read
LW
link
How do we know that “good research” is good? (aka “direct evaluation” vs “eigen-evaluation”)
Ruby
Jul 19, 2024, 12:31 AM
49
points
21
comments
6
min read
LW
link
Linkpost: Surely you can be serious
kave
Jul 18, 2024, 10:18 PM
62
points
8
comments
1
min read
LW
link
(www.experimental-history.com)
My experience applying to MATS 6.0
mic
Jul 18, 2024, 7:02 PM
17
points
3
comments
5
min read
LW
link
[Question]
What are the actual arguments in favor of computationalism as a theory of identity?
sunwillrise
Jul 18, 2024, 6:44 PM
12
points
26
comments
5
min read
LW
link
Yet Another Critique of “Luxury Beliefs”
ymeskhout
Jul 18, 2024, 6:37 PM
6
points
10
comments
9
min read
LW
link
(www.ymeskhout.com)
[Interim research report] Evaluating the Goal-Directedness of Language Models
Rauno Arike
,
Elizabeth Donoway
and
Marius Hobbhahn
Jul 18, 2024, 6:19 PM
40
points
4
comments
11
min read
LW
link
Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent
Karolis Jucys
,
george_adams
and
Sonia Joseph
Jul 18, 2024, 5:02 PM
9
points
0
comments
1
min read
LW
link
(arxiv.org)
Activation Engineering Theories of Impact
kubanetics
Jul 18, 2024, 4:44 PM
6
points
1
comment
2
min read
LW
link
[Question]
Me & My Clone
SimonBaars
Jul 18, 2024, 4:25 PM
27
points
22
comments
1
min read
LW
link
AI #73: Openly Evil AI
Zvi
Jul 18, 2024, 2:40 PM
89
points
20
comments
52
min read
LW
link
(thezvi.wordpress.com)
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey
,
Lucius Bushnaq
,
Dan Braun
,
StefanHex
and
Nicholas Goldowsky-Dill
Jul 18, 2024, 2:15 PM
122
points
18
comments
18
min read
LW
link
SAEs (usually) Transfer Between Base and Chat Models
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Jul 18, 2024, 10:29 AM
67
points
0
comments
10
min read
LW
link
[Question]
Should we exclude alignment research from LLM training datasets?
Ben Millwood
Jul 18, 2024, 10:27 AM
3
points
5
comments
1
min read
LW
link
Keeping content out of LLM training datasets
Ben Millwood
Jul 18, 2024, 10:27 AM
3
points
0
comments
5
min read
LW
link
The Assassination of Trump’s Ear is Evidence for Time-Travel
elv
Jul 18, 2024, 7:01 AM
−9
points
5
comments
5
min read
LW
link
Friendship is transactional, unconditional friendship is insurance
Ruby
Jul 17, 2024, 10:52 PM
67
points
24
comments
2
min read
LW
link
D&D.Sci: Whom Shall You Call? [Evaluation and Ruleset]
abstractapplic
Jul 17, 2024, 10:34 PM
17
points
5
comments
5
min read
LW
link
Optimistic Assumptions, Longterm Planning, and “Cope”
Raemon
Jul 17, 2024, 10:14 PM
215
points
46
comments
7
min read
LW
link
Baking vs Patissing vs Cooking, the HPS explanation
adamShimi
Jul 17, 2024, 8:29 PM
30
points
16
comments
3
min read
LW
link
(epistemologicalfascinations.substack.com)
Launching the Respiratory Outlook 2024/25 Forecasting Series
ChristianWilliams
Jul 17, 2024, 7:51 PM
5
points
0
comments
LW
link
(www.metaculus.com)
What are you getting paid in?
Austin Chen
Jul 17, 2024, 7:23 PM
92
points
14
comments
4
min read
LW
link
(www.approachwithalacrity.com)
Individually incentivized safe Pareto improvements in open-source bargaining
Nicolas Macé
,
Anthony DiGiovanni
and
JesseClifton
Jul 17, 2024, 6:26 PM
41
points
2
comments
17
min read
LW
link
Profit and Value
kwang
Jul 17, 2024, 6:06 PM
22
points
3
comments
6
min read
LW
link
(open.substack.com)
So You’ve Learned To Teleport by Tom Scott
landscape_kiwi
Jul 17, 2024, 6:04 PM
4
points
0
comments
1
min read
LW
link
(www.youtube.com)
How does generalized accessibility compare to targeted accessibility?
ErioirE
Jul 17, 2024, 5:07 PM
3
points
0
comments
2
min read
LW
link
Housing Roundup #9: Restricting Supply
Zvi
Jul 17, 2024, 12:50 PM
25
points
8
comments
44
min read
LW
link
(thezvi.wordpress.com)
We ran an AI safety conference in Tokyo. It went really well. Come next year!
Blaine
Jul 17, 2024, 6:55 AM
45
points
1
comment
6
min read
LW
link
Agency in Politics
Martin Sustrik
Jul 17, 2024, 5:30 AM
35
points
2
comments
3
min read
LW
link
(250bpm.substack.com)
Arrakis—A toolkit to conduct, track and visualize mechanistic interpretability experiments.
Yash Srivastava
Jul 17, 2024, 2:02 AM
3
points
2
comments
5
min read
LW
link
Announcing Open Philanthropy’s AI governance and policy RFP
Julian Hazell
Jul 17, 2024, 2:02 AM
25
points
0
comments
1
min read
LW
link
(www.openphilanthropy.org)
Turning Your Back On Traffic
jefftk
Jul 17, 2024, 1:00 AM
37
points
7
comments
1
min read
LW
link
(www.jefftk.com)
[Question]
Opinions on Eureka Labs
jmh
Jul 17, 2024, 12:16 AM
6
points
2
comments
1
min read
LW
link
Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural
Rubi J. Hudson
16 Jul 2024 22:44 UTC
44
points
27
comments
5
min read
LW
link
Multiplex Gene Editing: Where Are We Now?
sarahconstantin
16 Jul 2024 20:50 UTC
73
points
6
comments
7
min read
LW
link
(sarahconstantin.substack.com)
Recursion in AI is scary. But let’s talk solutions.
Oleg Trott
16 Jul 2024 20:34 UTC
3
points
10
comments
2
min read
LW
link
How to wash your hands precisely and thoroughly
dkl9
16 Jul 2024 18:29 UTC
12
points
0
comments
1
min read
LW
link
(dkl9.net)
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel