RSS

joshc

Karma: 1,631

Align­ment fak­ing CTFs: Ap­ply to my MATS stream

joshcApr 4, 2025, 4:29 PM
60 points
0 comments4 min readLW link

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshcFeb 24, 2025, 7:19 PM
45 points
23 comments7 min readLW link

How might we safely pass the buck to AI?

joshcFeb 19, 2025, 5:48 PM
83 points
58 comments31 min readLW link

How AI Takeover Might Hap­pen in 2 Years

joshcFeb 7, 2025, 5:10 PM
422 points
137 comments29 min readLW link
(x.com)

Take­aways from sketch­ing a con­trol safety case

joshcJan 31, 2025, 4:43 AM
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

A sketch of an AI con­trol safety case

Jan 30, 2025, 5:28 PM
57 points
0 comments5 min readLW link

Plan­ning for Ex­treme AI Risks

joshcJan 29, 2025, 6:33 PM
139 points
5 comments16 min readLW link

When does ca­pa­bil­ity elic­i­ta­tion bound risk?

joshcJan 22, 2025, 3:42 AM
25 points
0 comments17 min readLW link
(redwoodresearch.substack.com)

Ex­tend­ing con­trol eval­u­a­tions to non-schem­ing threats

joshcJan 12, 2025, 1:42 AM
30 points
1 comment12 min readLW link

New re­port: Safety Cases for AI

joshcMar 20, 2024, 4:45 PM
89 points
14 comments1 min readLW link
(twitter.com)

List of strate­gies for miti­gat­ing de­cep­tive alignment

joshcDec 2, 2023, 5:56 AM
38 points
2 comments6 min readLW link

New pa­per shows truth­ful­ness & in­struc­tion-fol­low­ing don’t gen­er­al­ize by default

joshcNov 19, 2023, 7:27 PM
60 points
0 comments4 min readLW link

Testbed evals: eval­u­at­ing AI safety even when it can’t be di­rectly mea­sured

joshcNov 15, 2023, 7:00 PM
71 points
2 comments4 min readLW link

Red team­ing: challenges and re­search directions

joshcMay 10, 2023, 1:40 AM
31 points
1 comment10 min readLW link

Safety stan­dards: a frame­work for AI regulation

joshcMay 1, 2023, 12:56 AM
19 points
0 comments8 min readLW link

Are short timelines ac­tu­ally bad?

joshcFeb 5, 2023, 9:21 PM
61 points
7 comments3 min readLW link

[MLSN #7]: an ex­am­ple of an emer­gent in­ter­nal optimizer

Jan 9, 2023, 7:39 PM
28 points
0 comments6 min readLW link

Prizes for ML Safety Bench­mark Ideas

joshcOct 28, 2022, 2:51 AM
36 points
5 comments1 min readLW link

[Question] What is the best cri­tique of AI ex­is­ten­tial risk ar­gu­ments?

joshcAug 30, 2022, 2:18 AM
6 points
11 comments1 min readLW link