Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
Archive
Sequences
About
Search
Log In
All
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
All
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
All
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Page
1
[Question]
Option Space Nomenclature
SilverFlame
Dec 8, 2023, 11:14 PM
1
point
0
comments
1
min read
LW
link
“Model UN Solutions”
Arjun Panickssery
Dec 8, 2023, 11:06 PM
36
points
5
comments
1
min read
LW
link
(open.substack.com)
Speed arguments against scheming (Section 4.4-4.7 of “Scheming AIs”)
Joe Carlsmith
Dec 8, 2023, 9:09 PM
9
points
0
comments
15
min read
LW
link
Modeling incentives at scale using LLMs
Bruno Marnette
,
pzahn
and
cmck
Dec 8, 2023, 6:46 PM
7
points
3
comments
13
min read
LW
link
Refusal mechanisms: initial experiments with Llama-2-7b-chat
Andy Arditi
and
Oscar Obeso
Dec 8, 2023, 5:08 PM
82
points
7
comments
7
min read
LW
link
Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study
Karolis Jucys
Dec 8, 2023, 1:18 PM
16
points
1
comment
4
min read
LW
link
(arxiv.org)
What I Would Do If I Were Working On AI Governance
johnswentworth
Dec 8, 2023, 6:43 AM
110
points
32
comments
10
min read
LW
link
Whither Prison Abolition?
MadHatter
Dec 8, 2023, 5:27 AM
−7
points
0
comments
16
min read
LW
link
(bittertruths.substack.com)
Class consciousness for those against the class system
TekhneMakre
Dec 8, 2023, 1:02 AM
11
points
9
comments
1
min read
LW
link
Building selfless agents to avoid instrumental self-preservation.
blallo
Dec 7, 2023, 6:59 PM
14
points
2
comments
6
min read
LW
link
Does Chat-GPT display ‘Scope Insensitivity’?
callum
Dec 7, 2023, 6:58 PM
11
points
0
comments
3
min read
LW
link
LLM keys—A Proposal of a Solution to Prompt Injection Attacks
Peter Hroššo
Dec 7, 2023, 5:36 PM
1
point
2
comments
1
min read
LW
link
Meetup Tip: Heartbeat Messages
Screwtape
Dec 7, 2023, 5:18 PM
69
points
4
comments
3
min read
LW
link
[Valence series] 2. Valence & Normativity
Steven Byrnes
Dec 7, 2023, 4:43 PM
88
points
7
comments
28
min read
LW
link
1
review
AISN #27: Defensive Accelerationism, A Retrospective On The OpenAI Board Saga, And A New AI Bill From Senators Thune And Klobuchar
Dan H
,
Corin Katzke
and
allison huang
Dec 7, 2023, 3:59 PM
13
points
0
comments
6
min read
LW
link
(newsletter.safe.ai)
AI #41: Bring in the Other Gemini
Zvi
Dec 7, 2023, 3:10 PM
46
points
16
comments
52
min read
LW
link
(thezvi.wordpress.com)
Simplicity arguments for scheming (Section 4.3 of “Scheming AIs”)
Joe Carlsmith
Dec 7, 2023, 3:05 PM
10
points
1
comment
19
min read
LW
link
Gemini 1.0
Zvi
Dec 7, 2023, 2:40 PM
50
points
7
comments
9
min read
LW
link
(thezvi.wordpress.com)
Random Musings on Theory of Impact for Activation Vectors
Chris_Leong
Dec 7, 2023, 1:07 PM
8
points
0
comments
1
min read
LW
link
[Question]
Is AlphaGo actually a consequentialist utility maximizer?
faul_sname
Dec 7, 2023, 12:41 PM
36
points
8
comments
3
min read
LW
link
(Report) Evaluating Taiwan’s Tactics to Safeguard its Semiconductor Assets Against a Chinese Invasion
Gauraventh
Dec 7, 2023, 11:50 AM
14
points
5
comments
22
min read
LW
link
(bristolaisafety.org)
Would AIs trapped in the Metaverse pine to enter the real world and would the ramifications cause trouble?
ProfessorFalken
Dec 7, 2023, 10:17 AM
−2
points
1
comment
1
min read
LW
link
The GiveWiki’s Top Picks in AI Safety for the Giving Season of 2023
Dawn Drescher
Dec 7, 2023, 9:23 AM
4
points
10
comments
LW
link
(impactmarkets.substack.com)
Language Model Memorization, Copyright Law, and Conditional Pretraining Alignment
RogerDearnaley
Dec 7, 2023, 6:14 AM
9
points
0
comments
11
min read
LW
link
Reflective consistency, randomized decisions, and the dangers of unrealistic thought experiments
Radford Neal
Dec 7, 2023, 3:33 AM
34
points
25
comments
6
min read
LW
link
[Question]
For fun: How long can you hold your breath?
exanova
Dec 6, 2023, 11:36 PM
1
point
7
comments
1
min read
LW
link
Mathematics As Physics
Nox ML
Dec 6, 2023, 10:27 PM
−2
points
10
comments
5
min read
LW
link
The counting argument for scheming (Sections 4.1 and 4.2 of “Scheming AIs”)
Joe Carlsmith
Dec 6, 2023, 7:28 PM
10
points
0
comments
10
min read
LW
link
On Trust
johnswentworth
Dec 6, 2023, 7:19 PM
42
points
26
comments
4
min read
LW
link
Originality vs. Correctness
alkjash
and
habryka
Dec 6, 2023, 6:51 PM
60
points
17
comments
25
min read
LW
link
Proposal for improving the global online discourse through personalised comment ordering on all websites
Roman Leventov
Dec 6, 2023, 6:51 PM
35
points
21
comments
6
min read
LW
link
Google Gemini Announced
Jacob G-W
Dec 6, 2023, 4:14 PM
54
points
22
comments
1
min read
LW
link
(blog.google)
Based Beff Jezos and the Accelerationists
Zvi
Dec 6, 2023, 4:00 PM
90
points
29
comments
12
min read
LW
link
(thezvi.wordpress.com)
Bucket Brigade: Likely End-of-Life
jefftk
Dec 6, 2023, 3:30 PM
16
points
1
comment
1
min read
LW
link
(www.jefftk.com)
Why Yudkowsky is wrong about “covalently bonded equivalents of biology”
titotal
Dec 6, 2023, 2:09 PM
44
points
41
comments
LW
link
(open.substack.com)
Metaculus Launches Chinese AI Chips Tournament, Supporting Institute for AI Policy and Strategy Research
ChristianWilliams
Dec 6, 2023, 11:26 AM
10
points
1
comment
LW
link
(www.metaculus.com)
Minimal Viable Paradise: How do we get The Good Future(TM)?
Nathan Young
Dec 6, 2023, 9:24 AM
9
points
0
comments
7
min read
LW
link
Anthropical Paradoxes are Paradoxes of Probability Theory
Ape in the coat
Dec 6, 2023, 8:16 AM
55
points
18
comments
5
min read
LW
link
Digital humans vs merge with AI? Same or different?
Nathan Helm-Burger
and
mishka
Dec 6, 2023, 4:56 AM
21
points
11
comments
7
min read
LW
link
EA Infrastructure Fund’s Plan to Focus on Principles-First EA
Linch
Dec 6, 2023, 3:24 AM
27
points
0
comments
LW
link
**In defence of Helen Toner, Adam D’Angelo, and Tasha McCauley**
mrtreasure
Dec 6, 2023, 2:02 AM
25
points
3
comments
9
min read
LW
link
(pastebin.com)
Some quick thoughts on “AI is easy to control”
Mikhail Samin
Dec 6, 2023, 12:58 AM
15
points
10
comments
7
min read
LW
link
ACX Corvallis, OR
kenakofer
Dec 6, 2023, 12:23 AM
1
point
0
comments
1
min read
LW
link
Multinational corporations as optimizers: a case for reaching across the aisle
sudo-nym
Dec 6, 2023, 12:14 AM
9
points
10
comments
1
min read
LW
link
[Question]
How do you feel about LessWrong these days? [Open feedback thread]
Bird Concept
Dec 5, 2023, 8:54 PM
108
points
285
comments
1
min read
LW
link
Critique-a-Thon of AI Alignment Plans
Iknownothing
Dec 5, 2023, 8:50 PM
12
points
3
comments
1
min read
LW
link
Arguments for/against scheming that focus on the path SGD takes (Section 3 of “Scheming AIs”)
Joe Carlsmith
Dec 5, 2023, 6:48 PM
10
points
0
comments
23
min read
LW
link
In defence of Helen Toner, Adam D’Angelo, and Tasha McCauley (OpenAI post)
mrtreasure
Dec 5, 2023, 6:40 PM
6
points
2
comments
1
min read
LW
link
(pastebin.com)
Studying The Alien Mind
Quentin FEUILLADE--MONTIXI
and
NicholasKees
Dec 5, 2023, 5:27 PM
80
points
10
comments
15
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
Dec 5, 2023, 4:48 PM
126
points
30
comments
13
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel