RSS

Joe Carlsmith

Karma: 5,158

Senior research analyst at Open Philanthropy. Doctorate in philosophy from the University of Oxford. Opinions my own.

The goal-guard­ing hy­poth­e­sis (Sec­tion 2.3.1.1 of “Schem­ing AIs”)

Joe CarlsmithDec 2, 2023, 3:20 PM
8 points
1 comment15 min readLW link

How use­ful for al­ign­ment-rele­vant work are AIs with short-term goals? (Sec­tion 2.2.4.3 of “Schem­ing AIs”)

Joe CarlsmithDec 1, 2023, 2:51 PM
10 points
1 comment7 min readLW link

Is schem­ing more likely in mod­els trained to have long-term goals? (Sec­tions 2.2.4.1-2.2.4.2 of “Schem­ing AIs”)

Joe CarlsmithNov 30, 2023, 4:43 PM
8 points
0 comments6 min readLW link

“Clean” vs. “messy” goal-di­rect­ed­ness (Sec­tion 2.2.3 of “Schem­ing AIs”)

Joe CarlsmithNov 29, 2023, 4:32 PM
29 points
1 comment11 min readLW link

Two sources of be­yond-epi­sode goals (Sec­tion 2.2.2 of “Schem­ing AIs”)

Joe CarlsmithNov 28, 2023, 1:49 PM
11 points
1 comment15 min readLW link

Two con­cepts of an “epi­sode” (Sec­tion 2.2.1 of “Schem­ing AIs”)

Joe CarlsmithNov 27, 2023, 6:01 PM
19 points
1 comment13 min readLW link

Si­tu­a­tional aware­ness (Sec­tion 2.1 of “Schem­ing AIs”)

Joe CarlsmithNov 26, 2023, 11:00 PM
10 points
5 comments8 min readLW link

On “slack” in train­ing (Sec­tion 1.5 of “Schem­ing AIs”)

Joe CarlsmithNov 25, 2023, 5:51 PM
1 point
0 comments5 min readLW link

Why fo­cus on schemers in par­tic­u­lar (Sec­tions 1.3 and 1.4 of “Schem­ing AIs”)

Joe CarlsmithNov 24, 2023, 7:18 PM
8 points
0 comments22 min readLW link

A tax­on­omy of non-schemer mod­els (Sec­tion 1.2 of “Schem­ing AIs”)

Joe CarlsmithNov 22, 2023, 3:24 PM
13 points
0 comments13 min readLW link

Va­ri­eties of fake al­ign­ment (Sec­tion 1.1 of “Schem­ing AIs”)

Joe CarlsmithNov 21, 2023, 3:00 PM
15 points
0 comments12 min readLW link

New re­port: “Schem­ing AIs: Will AIs fake al­ign­ment dur­ing train­ing in or­der to get power?”

Joe CarlsmithNov 15, 2023, 5:16 PM
81 points
28 comments30 min readLW link1 review

Su­perfore­cast­ing the premises in “Is power-seek­ing AI an ex­is­ten­tial risk?”

Joe CarlsmithOct 18, 2023, 8:23 PM
31 points
3 comments5 min readLW link

In mem­ory of Louise Glück

Joe CarlsmithOct 15, 2023, 2:59 AM
41 points
1 comment8 min readLW link

The “no sand­bag­ging on check­able tasks” hypothesis

Joe CarlsmithJul 31, 2023, 11:06 PM
61 points
14 comments9 min readLW link

Pre­dictable up­dat­ing about AI risk

Joe CarlsmithMay 8, 2023, 9:53 PM
294 points
25 comments36 min readLW link1 review

[Linkpost] Shorter ver­sion of re­port on ex­is­ten­tial risk from power-seek­ing AI

Joe CarlsmithMar 22, 2023, 6:09 PM
7 points
0 comments1 min readLW link

A Stranger Pri­or­ity? Topics at the Outer Reaches of Effec­tive Altru­ism (my dis­ser­ta­tion)

Joe CarlsmithFeb 21, 2023, 5:26 PM
38 points
16 comments1 min readLW link

See­ing more whole

Joe CarlsmithFeb 17, 2023, 5:12 AM
31 points
1 comment26 min readLW link

Why should eth­i­cal anti-re­al­ists do ethics?

Joe CarlsmithFeb 16, 2023, 4:27 PM
38 points
7 comments27 min readLW link