Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Page
1
My guess at Conjecture’s vision: triggering a narrative bifurcation
Alexandre Variengien
Feb 6, 2024, 7:10 PM
75
points
12
comments
16
min read
LW
link
Arrogance and People Pleasing
Jonathan Moregård
Feb 6, 2024, 6:43 PM
26
points
7
comments
4
min read
LW
link
(honestliving.substack.com)
What does davidad want from «boundaries»?
Chipmonk
and
davidad
Feb 6, 2024, 5:45 PM
47
points
1
comment
5
min read
LW
link
[Question]
How can I efficiently read all the Dath Ilan worldbuilding?
mike_hawke
Feb 6, 2024, 4:52 PM
10
points
1
comment
1
min read
LW
link
Preventing model exfiltration with upload limits
ryan_greenblatt
Feb 6, 2024, 4:29 PM
71
points
22
comments
14
min read
LW
link
Evolution is an observation, not a process
Neil
Feb 6, 2024, 2:49 PM
8
points
11
comments
5
min read
LW
link
[Question]
Why do we need an understanding of the real world to predict the next tokens in a body of text?
Valentin Baltadzhiev
Feb 6, 2024, 2:43 PM
2
points
12
comments
1
min read
LW
link
On the Debate Between Jezos and Leahy
Zvi
Feb 6, 2024, 2:40 PM
64
points
6
comments
63
min read
LW
link
(thezvi.wordpress.com)
Why Two Valid Answers Approach is not Enough for Sleeping Beauty
Ape in the coat
Feb 6, 2024, 2:21 PM
6
points
12
comments
6
min read
LW
link
Are most personality disorders really trust disorders?
chaosmage
Feb 6, 2024, 12:37 PM
20
points
4
comments
1
min read
LW
link
From Conceptual Spaces to Quantum Concepts: Formalising and Learning Structured Conceptual Models
Roman Leventov
Feb 6, 2024, 10:18 AM
8
points
1
comment
4
min read
LW
link
(arxiv.org)
Fluent dreaming for language models (AI interpretability method)
tbenthompson
,
mikes
and
Zygi Straznickas
Feb 6, 2024, 6:02 AM
46
points
5
comments
1
min read
LW
link
(arxiv.org)
Selfish AI Inevitable
Davey Morse
Feb 6, 2024, 4:29 AM
1
point
0
comments
1
min read
LW
link
Toy models of AI control for concentrated catastrophe prevention
Fabien Roger
and
Buck
Feb 6, 2024, 1:38 AM
51
points
2
comments
7
min read
LW
link
Things You’re Allowed to Do: University Edition
Saul Munn
Feb 6, 2024, 12:36 AM
97
points
13
comments
5
min read
LW
link
(www.brasstacks.blog)
Value learning in the absence of ground truth
Joel_Saarinen
Feb 5, 2024, 6:56 PM
47
points
8
comments
45
min read
LW
link
Implementing activation steering
Annah
Feb 5, 2024, 5:51 PM
75
points
8
comments
7
min read
LW
link
AI alignment as a translation problem
Roman Leventov
Feb 5, 2024, 2:14 PM
22
points
2
comments
3
min read
LW
link
Safe Stasis Fallacy
Davidmanheim
Feb 5, 2024, 10:54 AM
54
points
2
comments
LW
link
[Question]
How has internalising a post-AGI world affected your current choices?
yanni kyriacos
Feb 5, 2024, 5:43 AM
10
points
8
comments
1
min read
LW
link
A thought experiment for comparing “biological” vs “digital” intelligence increase/explosion
Super AGI
Feb 5, 2024, 4:57 AM
6
points
3
comments
1
min read
LW
link
Noticing Panic
Cole Wyeth
Feb 5, 2024, 3:45 AM
59
points
8
comments
3
min read
LW
link
EA/ACX/LW February Santa Cruz Meetup
madmail
Feb 4, 2024, 11:26 PM
1
point
0
comments
1
min read
LW
link
Vitalia Rationality Meetup
veronica
Feb 4, 2024, 7:46 PM
1
point
0
comments
1
min read
LW
link
Personal predictions
Daniele De Nuntiis
Feb 4, 2024, 3:59 AM
2
points
2
comments
3
min read
LW
link
A sketch of acausal trade in practice
Richard_Ngo
Feb 4, 2024, 12:32 AM
36
points
4
comments
7
min read
LW
link
Brute Force Manufactured Consensus is Hiding the Crime of the Century
Roko
Feb 3, 2024, 8:36 PM
209
points
156
comments
9
min read
LW
link
My thoughts on the Beff Jezos—Connor Leahy debate
kwiat.dev
Feb 3, 2024, 7:47 PM
−5
points
23
comments
4
min read
LW
link
The Journal of Dangerous Ideas
rogersbacon
Feb 3, 2024, 3:40 PM
−25
points
4
comments
5
min read
LW
link
(www.secretorum.life)
Attitudes about Applied Rationality
Camille Berger
Feb 3, 2024, 2:42 PM
108
points
18
comments
4
min read
LW
link
Practicing my Handwriting in 1439
Maxwell Tabarrok
Feb 3, 2024, 1:21 PM
11
points
0
comments
3
min read
LW
link
(www.maximum-progress.com)
Finite Factored Sets to Bayes Nets Part 2
J Bostock
Feb 3, 2024, 12:25 PM
6
points
0
comments
8
min read
LW
link
Why I no longer identify as transhumanist
Kaj_Sotala
Feb 3, 2024, 12:00 PM
55
points
33
comments
3
min read
LW
link
(kajsotala.fi)
Attention SAEs Scale to GPT-2 Small
Connor Kissane
,
robertzk
,
Arthur Conmy
and
Neel Nanda
Feb 3, 2024, 6:50 AM
78
points
4
comments
8
min read
LW
link
Why do we need RLHF? Imitation, Inverse RL, and the role of reward
Ran W
Feb 3, 2024, 4:00 AM
16
points
0
comments
5
min read
LW
link
Announcing the London Initiative for Safe AI (LISA)
James Fox
,
mike_safeAI
and
Ryan Kidd
Feb 2, 2024, 11:17 PM
98
points
0
comments
9
min read
LW
link
Survey for alignment researchers!
Cameron Berg
,
Judd Rosenblatt
and
AE Studio
Feb 2, 2024, 8:41 PM
71
points
11
comments
1
min read
LW
link
Voting Results for the 2022 Review
Ben Pace
Feb 2, 2024, 8:34 PM
57
points
3
comments
73
min read
LW
link
On Dwarkesh’s 3rd Podcast With Tyler Cowen
Zvi
Feb 2, 2024, 7:30 PM
36
points
9
comments
21
min read
LW
link
(thezvi.wordpress.com)
Most experts believe COVID-19 was probably not a lab leak
DanielFilan
Feb 2, 2024, 7:28 PM
66
points
89
comments
2
min read
LW
link
(gcrinstitute.org)
What Failure Looks Like is not an existential risk (and alignment is not the solution)
otto.barten
Feb 2, 2024, 6:59 PM
13
points
12
comments
9
min read
LW
link
Solving alignment isn’t enough for a flourishing future
mic
Feb 2, 2024, 6:23 PM
27
points
0
comments
LW
link
(papers.ssrn.com)
Manifold Markets
PeterMcCluskey
Feb 2, 2024, 5:48 PM
26
points
9
comments
4
min read
LW
link
(bayesianinvestor.com)
Types of subjective welfare
MichaelStJules
Feb 2, 2024, 9:56 AM
10
points
3
comments
LW
link
Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small
Joseph Bloom
Feb 2, 2024, 6:54 AM
103
points
37
comments
15
min read
LW
link
Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
porby
Feb 2, 2024, 5:49 AM
47
points
1
comment
4
min read
LW
link
(arxiv.org)
Running a Prediction Market Mafia Game
Arjun Panickssery
Feb 1, 2024, 11:24 PM
22
points
5
comments
1
min read
LW
link
(arjunpanickssery.substack.com)
Evaluating Stability of Unreflective Alignment
james.lucassen
Feb 1, 2024, 10:15 PM
57
points
12
comments
18
min read
LW
link
(jlucassen.com)
Davidad’s Provably Safe AI Architecture—ARIA’s Programme Thesis
simeon_c
Feb 1, 2024, 9:30 PM
69
points
17
comments
1
min read
LW
link
(www.aria.org.uk)
Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis
RogerDearnaley
Feb 1, 2024, 9:15 PM
16
points
15
comments
13
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel