Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Page
1
Jacob on the Precipice
Richard_Ngo
Sep 26, 2023, 9:16 PM
45
points
8
comments
11
min read
LW
link
(narrativeark.substack.com)
Text Posts from the Kids Group: 2022
jefftk
Sep 26, 2023, 8:40 PM
33
points
2
comments
7
min read
LW
link
(www.jefftk.com)
GPT-4 for personal productivity: online distraction blocker
Sergii
Sep 26, 2023, 5:41 PM
65
points
13
comments
2
min read
LW
link
(grgv.xyz)
ARENA 2.0 - Impact Report
CallumMcDougall
Sep 26, 2023, 5:13 PM
35
points
5
comments
13
min read
LW
link
Mechanistic Interpretability Reading group
1stuserhere
and
woog
Sep 26, 2023, 4:26 PM
15
points
0
comments
1
min read
LW
link
Announcing the CNN Interpretability Competition
scasper
Sep 26, 2023, 4:21 PM
22
points
0
comments
4
min read
LW
link
Making AIs less likely to be spiteful
Nicolas Macé
,
Anthony DiGiovanni
and
JesseClifton
Sep 26, 2023, 2:12 PM
118
points
7
comments
10
min read
LW
link
[Linkpost] Mark Zuckerberg confronted about Meta’s Llama 2 AI’s ability to give users detailed guidance on making anthrax—Business Insider
mic
Sep 26, 2023, 12:05 PM
18
points
11
comments
2
min read
LW
link
(www.businessinsider.com)
Enforcing Far-Future Contracts for Governments
FCCC
Sep 26, 2023, 4:26 AM
−7
points
49
comments
3
min read
LW
link
Carioca Petrov Day
Giskard
Sep 26, 2023, 12:30 AM
1
point
0
comments
1
min read
LW
link
[Question]
A few Alignment questions: utility optimizers, SLT, sharp left turn and identifiability
Igor Timofeev
Sep 26, 2023, 12:27 AM
6
points
1
comment
2
min read
LW
link
Impact stories for model internals: an exercise for interpretability researchers
jenny
Sep 25, 2023, 11:15 PM
29
points
3
comments
7
min read
LW
link
Autonomic Sanity
Sable
Sep 25, 2023, 10:37 PM
20
points
9
comments
4
min read
LW
link
(affablyevil.substack.com)
[Question]
What is wrong with this “utility switch button problem” approach?
Donald Hobson
Sep 25, 2023, 9:36 PM
14
points
3
comments
1
min read
LW
link
You should just smile at strangers a lot
chaosmage
Sep 25, 2023, 8:12 PM
14
points
10
comments
1
min read
LW
link
The King and the Golem
Richard_Ngo
Sep 25, 2023, 7:51 PM
190
points
19
comments
5
min read
LW
link
1
review
(narrativeark.substack.com)
Public Opinion on AI Safety: AIMS 2023 and 2021 Summary
Jacy Reese Anthis
,
Janet Pauketat
and
Ali
Sep 25, 2023, 6:55 PM
3
points
2
comments
3
min read
LW
link
(www.sentienceinstitute.org)
Welcome to Apply: The 2024 Vitalik Buterin Fellowships in AI Existential Safety by FLI!
Zhijing Jin
Sep 25, 2023, 6:42 PM
5
points
2
comments
2
min read
LW
link
Evaluating hidden directions on the utility dataset: classification, steering and removal
Annah
and
shash42
Sep 25, 2023, 5:19 PM
25
points
3
comments
7
min read
LW
link
Linkpost: A model of biases as arising from meta-beliefs
JuanGarcia
Sep 25, 2023, 5:14 PM
5
points
0
comments
1
min read
LW
link
[Question]
What causes a decision theory to be used?
Dagon
Sep 25, 2023, 4:33 PM
8
points
2
comments
1
min read
LW
link
Understanding strategic deception and deceptive alignment
Marius Hobbhahn
,
Mikita Balesni
,
Jérémy Scheurer
and
Dan Braun
Sep 25, 2023, 4:27 PM
64
points
16
comments
7
min read
LW
link
(www.apolloresearch.ai)
The Merits of Contrarianism & Why I hate Chatbots. [My Experience with the Ideological Turing Test @ a Less Wrong meetup]
Amina V.
Sep 25, 2023, 4:13 PM
4
points
1
comment
1
min read
LW
link
(bimbollectual.com)
Inside Views, Impostor Syndrome, and the Great LARP
johnswentworth
Sep 25, 2023, 4:08 PM
336
points
53
comments
5
min read
LW
link
“X distracts from Y” as a thinly-disguised fight over group status / politics
Steven Byrnes
Sep 25, 2023, 3:18 PM
112
points
14
comments
8
min read
LW
link
Amazon to invest up to $4 billion in Anthropic
Davis_Kingsley
Sep 25, 2023, 2:55 PM
44
points
8
comments
LW
link
(twitter.com)
Should Effective Altruists be Valuists instead of utilitarians?
spencerg
and
AmberDawn
Sep 25, 2023, 2:03 PM
1
point
3
comments
6
min read
LW
link
Feedly Breaks MathML
jefftk
Sep 25, 2023, 1:40 PM
15
points
3
comments
1
min read
LW
link
(www.jefftk.com)
[Question]
How have you become more hard-working?
Chi Nguyen
Sep 25, 2023, 12:37 PM
82
points
42
comments
LW
link
Automating Intelligence: A Cursory Glance at How AutoML Brings Precision to AI Development
RoscoHunter
Sep 25, 2023, 9:39 AM
3
points
0
comments
3
min read
LW
link
Interpreting OpenAI’s Whisper
EllenaR
Sep 24, 2023, 5:53 PM
116
points
13
comments
7
min read
LW
link
Contradiction Appeal Bias
onur
Sep 24, 2023, 5:03 PM
3
points
2
comments
1
min read
LW
link
RAIN: Your Language Models Can Align Themselves without Finetuning—Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!
Singularian2501
Sep 24, 2023, 4:48 PM
5
points
0
comments
1
min read
LW
link
Honor System for Vaccination?
jefftk
Sep 24, 2023, 11:50 AM
17
points
22
comments
1
min read
LW
link
(www.jefftk.com)
Far-Future Commitments as a Policy Consensus Strategy
FCCC
Sep 24, 2023, 6:34 AM
7
points
40
comments
1
min read
LW
link
Five neglected work areas that could reduce AI risk
CharlotteS
and
Aaron_Scher
Sep 24, 2023, 2:03 AM
17
points
5
comments
9
min read
LW
link
[Question]
Are the other Rationality: A-Z sequences coming out as books?
caffeinated_dissonance
Sep 24, 2023, 12:38 AM
7
points
4
comments
1
min read
LW
link
The Dick Kick’em Paradox
Augs SMSHacks
Sep 23, 2023, 10:22 PM
−5
points
21
comments
1
min read
LW
link
I designed an AI safety course (for a philosophy department)
Eleni Angelou
Sep 23, 2023, 10:03 PM
37
points
15
comments
2
min read
LW
link
Paper: LLMs trained on “A is B” fail to learn “B is A”
lberglund
,
Owain_Evans
,
Meg
,
Maximilian Kaufmann
,
Mikita Balesni
,
Asa Cooper Stickland
and
Tomek Korbak
Sep 23, 2023, 7:55 PM
121
points
74
comments
4
min read
LW
link
(arxiv.org)
Sparse Coding, for Mechanistic Interpretability and Activation Engineering
David Udell
Sep 23, 2023, 7:16 PM
42
points
7
comments
34
min read
LW
link
[Question]
Places to meet interesting middle-aged men?
anon_girl
23 Sep 2023 19:06 UTC
18
points
6
comments
1
min read
LW
link
Taking features out of superposition with sparse autoencoders more quickly with informed initialization
Pierre Peigné
23 Sep 2023 16:21 UTC
30
points
8
comments
5
min read
LW
link
A quick remark on so-called “hallucinations” in LLMs and humans
Bill Benzon
23 Sep 2023 12:17 UTC
4
points
4
comments
1
min read
LW
link
Hand-writing MathML
jefftk
23 Sep 2023 11:20 UTC
16
points
40
comments
1
min read
LW
link
(www.jefftk.com)
Musk, Starlink, and Crimea
Nicholas / Heather Kross
23 Sep 2023 2:35 UTC
−13
points
0
comments
5
min read
LW
link
[Linkpost/Video] All The Times We Nearly Blew Up The World
Jacob G-W
23 Sep 2023 1:18 UTC
6
points
1
comment
1
min read
LW
link
(www.youtube.com)
Luck based medicine: inositol for anxiety and brain fog
Elizabeth
22 Sep 2023 20:10 UTC
40
points
5
comments
3
min read
LW
link
(acesounderglass.com)
If influence functions are not approximating leave-one-out, how are they supposed to help?
Fabien Roger
22 Sep 2023 14:23 UTC
66
points
5
comments
3
min read
LW
link
Modeling p(doom) with TrojanGDP
K. Liam Smith
22 Sep 2023 14:19 UTC
−2
points
2
comments
13
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel