Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
Jul 31, 2023, 11:06 PM
61
points
14
comments
9
min read
LW
link
A Social History of Truth
Vaniver
Jul 31, 2023, 10:49 PM
64
points
2
comments
14
min read
LW
link
Watermarking considered overrated?
DanielFilan
Jul 31, 2023, 9:36 PM
19
points
4
comments
1
min read
LW
link
What The Lord of the Rings Teaches Us About AI Alignment
Jeffrey Heninger
Jul 31, 2023, 8:16 PM
24
points
12
comments
7
min read
LW
link
The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited
mwatkins
Jul 31, 2023, 7:47 PM
85
points
29
comments
20
min read
LW
link
“Building a House” Review
jefftk
Jul 31, 2023, 7:20 PM
62
points
6
comments
1
min read
LW
link
(www.jefftk.com)
The Meaning of Shoggoth AI Memes
Dan Smith
Jul 31, 2023, 6:52 PM
−5
points
5
comments
2
min read
LW
link
[Question]
Is there any existing term summarizing non-scalable oversight methods in outer alignment?
Allen Shen
Jul 31, 2023, 5:31 PM
1
point
0
comments
1
min read
LW
link
Lack of Social Grace Is an Epistemic Virtue
Zack_M_Davis
Jul 31, 2023, 4:38 PM
41
points
105
comments
4
min read
LW
link
2
reviews
Thoughts on sharing information about language model capabilities
paulfchristiano
Jul 31, 2023, 4:04 PM
210
points
44
comments
11
min read
LW
link
1
review
Trading off compute in training and inference (Overview)
Pablo Villalobos
Jul 31, 2023, 4:03 PM
42
points
2
comments
7
min read
LW
link
(epochai.org)
Open Problems and Fundamental Limitations of RLHF
scasper
Jul 31, 2023, 3:31 PM
66
points
6
comments
2
min read
LW
link
(arxiv.org)
“Not Necessarily”
Benjamin Hendricks
Jul 31, 2023, 3:19 PM
24
points
2
comments
2
min read
LW
link
How to find AI alignment researchers to collaborate with?
Florian Dietz
Jul 31, 2023, 9:05 AM
2
points
2
comments
1
min read
LW
link
[Question]
Is Kennedy a Nazi?
Pee Doom
Jul 31, 2023, 8:51 AM
−12
points
10
comments
2
min read
LW
link
Is Light Drinking Protective?
jefftk
Jul 31, 2023, 3:00 AM
45
points
8
comments
2
min read
LW
link
(www.jefftk.com)
EU’s AI ambitions at risk as US pushes to water down international treaty (linkpost)
mic
Jul 31, 2023, 12:34 AM
10
points
0
comments
4
min read
LW
link
(www.euractiv.com)
The rise of AI in cybercrime
BobyResearcher
Jul 30, 2023, 8:19 PM
−15
points
1
comment
2
min read
LW
link
(riseofAIincybercryme)
SSA vs. SIA: how future population may provide evidence for or against the foundations of political liberalism
j
Jul 30, 2023, 8:18 PM
−6
points
10
comments
55
min read
LW
link
Rationalization Maximizes Expected Value
Kevin Dorst
Jul 30, 2023, 8:11 PM
19
points
10
comments
7
min read
LW
link
(kevindorst.substack.com)
Apollo Neuro Results
Elizabeth
Jul 30, 2023, 6:40 PM
85
points
17
comments
3
min read
LW
link
(acesounderglass.com)
Hilbert’s Triumph, Church and Turing’s failure, and what it means (Post #2)
Noosphere89
Jul 30, 2023, 2:33 PM
−5
points
16
comments
15
min read
LW
link
[Question]
Specific Arguments against open source LLMs?
Iknownothing
Jul 30, 2023, 2:27 PM
4
points
2
comments
1
min read
LW
link
Socialism in large organizations
Adam Zerner
Jul 30, 2023, 7:25 AM
7
points
16
comments
2
min read
LW
link
How to make real-money prediction markets on arbitrary topics (Outdated)
yutaka
Jul 30, 2023, 2:11 AM
57
points
13
comments
3
min read
LW
link
[Question]
Does decidability of a theory imply completeness of the theory?
Noosphere89
Jul 29, 2023, 11:53 PM
6
points
12
comments
1
min read
LW
link
[Question]
If I showed the EQ-SQ theory’s findings to be due to measurement bias, would anyone change their minds about it?
tailcalled
Jul 29, 2023, 7:38 PM
23
points
13
comments
1
min read
LW
link
Self-driving car bets
paulfchristiano
Jul 29, 2023, 6:10 PM
236
points
44
comments
5
min read
LW
link
(sideways-view.com)
The Parable of the Dagger—The Animation
Writer
Jul 29, 2023, 2:03 PM
20
points
6
comments
1
min read
LW
link
(youtu.be)
Are Guitars Obsolete?
jefftk
Jul 29, 2023, 1:20 PM
11
points
8
comments
2
min read
LW
link
(www.jefftk.com)
NAMSI: A promising approach to alignment
Georgeo57
Jul 29, 2023, 7:03 AM
−6
points
6
comments
1
min read
LW
link
Understanding and Aligning a Human-like Inductive Bias with Cognitive Science: a Review of Related Literature
Claire Short
Jul 29, 2023, 6:10 AM
27
points
0
comments
12
min read
LW
link
Why You Should Never Update Your Beliefs
Arjun Panickssery
Jul 29, 2023, 12:27 AM
76
points
18
comments
4
min read
LW
link
1
review
(arjunpanickssery.substack.com)
Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)
RGRGRG
Jul 28, 2023, 8:44 PM
24
points
5
comments
20
min read
LW
link
Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic
ojorgensen
Jul 28, 2023, 7:43 PM
13
points
3
comments
13
min read
LW
link
When can we trust model evaluations?
evhub
Jul 28, 2023, 7:42 PM
166
points
10
comments
10
min read
LW
link
1
review
Yes, It’s Subjective, But Why All The Crabs?
johnswentworth
Jul 28, 2023, 7:35 PM
250
points
15
comments
6
min read
LW
link
Semaglutide and Muscle
5hout
Jul 28, 2023, 6:36 PM
15
points
14
comments
5
min read
LW
link
Double Crux in a Box
Screwtape
Jul 28, 2023, 5:55 PM
8
points
3
comments
1
min read
LW
link
Gradient descent might see the direction of the optimum from far away
Mikhail Samin
Jul 28, 2023, 4:19 PM
70
points
13
comments
4
min read
LW
link
Progress links digest, 2023-07-28: The decadent opulence of modern capitalism
jasoncrawford
Jul 28, 2023, 2:36 PM
16
points
3
comments
3
min read
LW
link
(rootsofprogress.org)
AI Awareness through Interaction with Blatantly Alien Models
VojtaKovarik
Jul 28, 2023, 8:41 AM
7
points
5
comments
3
min read
LW
link
You don’t get to have cool flaws
Neil
Jul 28, 2023, 5:37 AM
78
points
25
comments
2
min read
LW
link
3
reviews
Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
28 Jul 2023 2:46 UTC
122
points
18
comments
9
min read
LW
link
1
review
Mech Interp Puzzle 2: Word2Vec Style Embeddings
Neel Nanda
28 Jul 2023 0:50 UTC
41
points
4
comments
2
min read
LW
link
ETFE windows
bhauth
28 Jul 2023 0:46 UTC
31
points
4
comments
2
min read
LW
link
(www.bhauth.com)
A Short Memo on AI Interpretability Rainbows
scasper
27 Jul 2023 23:05 UTC
18
points
0
comments
2
min read
LW
link
Pulling the Rope Sideways: Empirical Test Results
Daniel Kokotajlo
27 Jul 2023 22:18 UTC
61
points
18
comments
1
min read
LW
link
A $10k retroactive grant for VaccinateCA
Austin Chen
27 Jul 2023 18:14 UTC
82
points
0
comments
LW
link
(manifund.org)
Preference Aggregation as Bayesian Inference
beren
27 Jul 2023 17:59 UTC
14
points
1
comment
1
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel