Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Page
1
On second thought, prompt injections are probably examples of misalignment
lc
Feb 20, 2023, 11:56 PM
22
points
5
comments
1
min read
LW
link
Nothing Is Ever Taught Correctly
LVSN
Feb 20, 2023, 10:31 PM
5
points
3
comments
1
min read
LW
link
Behavioral and mechanistic definitions (often confuse AI alignment discussions)
LawrenceC
Feb 20, 2023, 9:33 PM
33
points
5
comments
6
min read
LW
link
Validator models: A simple approach to detecting goodharting
beren
Feb 20, 2023, 9:32 PM
14
points
1
comment
4
min read
LW
link
There are no coherence theorems
Dan H
and
EJT
Feb 20, 2023, 9:25 PM
149
points
130
comments
19
min read
LW
link
1
review
[Question]
Are there any AI safety relevant fully remote roles suitable for someone with 2-3 years of machine learning engineering industry experience?
Malleable_shape
Feb 20, 2023, 7:57 PM
7
points
2
comments
1
min read
LW
link
A circuit for Python docstrings in a 4-layer attention-only transformer
StefanHex
and
Jett Janiak
Feb 20, 2023, 7:35 PM
96
points
8
comments
21
min read
LW
link
Sydney the Bingenator Can’t Think, But It Still Threatens People
Valentin Baltadzhiev
Feb 20, 2023, 6:37 PM
−3
points
2
comments
8
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
Feb 20, 2023, 6:25 PM
30
points
8
comments
8
min read
LW
link
What AI companies can do today to help with the most important century
HoldenKarnofsky
Feb 20, 2023, 5:00 PM
38
points
3
comments
9
min read
LW
link
(www.cold-takes.com)
Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky
bayesed
Feb 20, 2023, 4:42 PM
83
points
54
comments
1
min read
LW
link
(www.youtube.com)
Speculative Technologies launch and Ben Reinhardt AMA
jasoncrawford
Feb 20, 2023, 4:33 PM
16
points
0
comments
1
min read
LW
link
(rootsofprogress.org)
[MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming
Dan H
and
TW123
Feb 20, 2023, 3:54 PM
20
points
0
comments
4
min read
LW
link
(newsletter.mlsafety.org)
Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?
Christopher King
Feb 20, 2023, 3:11 PM
27
points
15
comments
1
min read
LW
link
Metaculus Introduces New ‘Conditional Pair’ Forecast Questions for Making Conditional Predictions
ChristianWilliams
Feb 20, 2023, 1:36 PM
40
points
0
comments
LW
link
On Investigating Conspiracy Theories
Zvi
Feb 20, 2023, 12:50 PM
116
points
38
comments
5
min read
LW
link
(thezvi.wordpress.com)
The Estimation Game: a monthly Fermi estimation web app
Sage Future
and
Adam B
Feb 20, 2023, 11:33 AM
20
points
2
comments
1
min read
LW
link
The idea that ChatGPT is simply “predicting” the next word is, at best, misleading
Bill Benzon
Feb 20, 2023, 11:32 AM
55
points
88
comments
5
min read
LW
link
Russell Conjugations list & voting thread
Daniel Kokotajlo
Feb 20, 2023, 6:39 AM
23
points
63
comments
1
min read
LW
link
Emergent Deception and Emergent Optimization
jsteinhardt
Feb 20, 2023, 2:40 AM
64
points
0
comments
14
min read
LW
link
(bounded-regret.ghost.io)
AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence
James Blaha
Feb 20, 2023, 12:55 AM
10
points
2
comments
18
min read
LW
link
Remote AI Alignment Overhang?
tryactions
Feb 19, 2023, 10:30 PM
37
points
5
comments
4
min read
LW
link
A Neural Network undergoing Gradient-based Training as a Complex System
carboniferous_umbraculum
Feb 19, 2023, 10:08 PM
22
points
1
comment
19
min read
LW
link
Another Way to Be Okay
Gretta Duleba
Feb 19, 2023, 8:49 PM
107
points
15
comments
6
min read
LW
link
A Way To Be Okay
Duncan Sabien (Inactive)
Feb 19, 2023, 8:27 PM
109
points
38
comments
10
min read
LW
link
1
review
Exploring Lily’s world with ChatGPT [things an AI won’t do]
Bill Benzon
Feb 19, 2023, 4:39 PM
5
points
0
comments
20
min read
LW
link
EIS VIII: An Engineer’s Understanding of Deceptive Alignment
scasper
Feb 19, 2023, 3:25 PM
30
points
5
comments
4
min read
LW
link
Does novel understanding imply novel agency / values?
TsviBT
Feb 19, 2023, 2:41 PM
18
points
0
comments
7
min read
LW
link
There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs
Taran
Feb 19, 2023, 12:25 PM
125
points
34
comments
4
min read
LW
link
Navigating public AI x-risk hype while pursuing technical solutions
Dan Braun
Feb 19, 2023, 12:22 PM
18
points
0
comments
2
min read
LW
link
Somewhat against “just update all the way”
tailcalled
Feb 19, 2023, 10:49 AM
31
points
10
comments
2
min read
LW
link
Human beats SOTA Go AI by learning an adversarial policy
Vanessa Kosoy
Feb 19, 2023, 9:38 AM
59
points
32
comments
1
min read
LW
link
(goattack.far.ai)
Degamification
Nate Showell
Feb 19, 2023, 5:35 AM
23
points
2
comments
2
min read
LW
link
Stop posting prompt injections on Twitter and calling it “misalignment”
lc
Feb 19, 2023, 2:21 AM
144
points
9
comments
1
min read
LW
link
AGI in sight: our look at the game board
Andrea_Miotti
and
Gabriel Alfour
Feb 18, 2023, 10:17 PM
227
points
135
comments
6
min read
LW
link
(andreamiotti.substack.com)
We should be signal-boosting anti Bing chat content
mbrooks
Feb 18, 2023, 6:52 PM
−4
points
13
comments
2
min read
LW
link
Can talk, can think, can suffer.
Ilio
Feb 18, 2023, 6:43 PM
1
point
8
comments
3
min read
LW
link
Parametrically retargetable decision-makers tend to seek power
TurnTrout
Feb 18, 2023, 6:41 PM
172
points
10
comments
2
min read
LW
link
(arxiv.org)
Near-Term Risks of an Obedient Artificial Intelligence
ymeskhout
Feb 18, 2023, 6:30 PM
20
points
1
comment
6
min read
LW
link
EIS VII: A Challenge for Mechanists
scasper
Feb 18, 2023, 6:27 PM
36
points
4
comments
3
min read
LW
link
Reading Speed Exists!
Johannes C. Mayer
Feb 18, 2023, 3:30 PM
12
points
9
comments
1
min read
LW
link
The Practitioner’s Path 2.0: the Meditative Archetype
Evenflair
Feb 18, 2023, 3:23 PM
14
points
1
comment
2
min read
LW
link
(guildoftherose.org)
Should we cry “wolf”?
Tapatakt
18 Feb 2023 11:24 UTC
24
points
5
comments
1
min read
LW
link
[Question]
Name of the fallacy of assuming an extreme value (e.g. 0) with the illusion of ‘avoiding to have to make an assumption’?
FlorianH
18 Feb 2023 8:11 UTC
4
points
1
comment
1
min read
LW
link
I Think We’re Approaching The Bitter Lesson’s Asymptote
SomeoneYouOnceKnew
18 Feb 2023 5:33 UTC
−3
points
9
comments
5
min read
LW
link
Bus-Only Bus Lane Enforcement
jefftk
18 Feb 2023 2:50 UTC
19
points
15
comments
1
min read
LW
link
(www.jefftk.com)
Run Head on Towards the Falling Tears
Johannes C. Mayer
18 Feb 2023 1:33 UTC
6
points
0
comments
2
min read
LW
link
Two problems with ‘Simulators’ as a frame
ryan_greenblatt
17 Feb 2023 23:34 UTC
79
points
13
comments
5
min read
LW
link
GPT-4 Predictions
Stephen McAleese
17 Feb 2023 23:20 UTC
110
points
27
comments
11
min read
LW
link
On Board Vision, Hollow Words, and the End of the World
Marcello
17 Feb 2023 23:18 UTC
52
points
27
comments
5
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel