RSS

joshc

Karma: 1,625

Align­ment fak­ing CTFs: Ap­ply to my MATS stream

joshc4 Apr 2025 16:29 UTC
60 points
0 comments4 min readLW link

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshc24 Feb 2025 19:19 UTC
45 points
23 comments7 min readLW link

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC
83 points
58 comments31 min readLW link

How AI Takeover Might Hap­pen in 2 Years

joshc7 Feb 2025 17:10 UTC
416 points
137 comments29 min readLW link
(x.com)

Take­aways from sketch­ing a con­trol safety case

joshc31 Jan 2025 4:43 UTC
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

Plan­ning for Ex­treme AI Risks

joshc29 Jan 2025 18:33 UTC
139 points
5 comments16 min readLW link

When does ca­pa­bil­ity elic­i­ta­tion bound risk?

joshc22 Jan 2025 3:42 UTC
25 points
0 comments17 min readLW link
(redwoodresearch.substack.com)

Ex­tend­ing con­trol eval­u­a­tions to non-schem­ing threats

joshc12 Jan 2025 1:42 UTC
30 points
1 comment12 min readLW link

New re­port: Safety Cases for AI

joshc20 Mar 2024 16:45 UTC
89 points
14 comments1 min readLW link
(twitter.com)

List of strate­gies for miti­gat­ing de­cep­tive alignment

joshc2 Dec 2023 5:56 UTC
38 points
2 comments6 min readLW link

New pa­per shows truth­ful­ness & in­struc­tion-fol­low­ing don’t gen­er­al­ize by default

joshc19 Nov 2023 19:27 UTC
60 points
0 comments4 min readLW link

Testbed evals: eval­u­at­ing AI safety even when it can’t be di­rectly mea­sured

joshc15 Nov 2023 19:00 UTC
71 points
2 comments4 min readLW link

Red team­ing: challenges and re­search directions

joshc10 May 2023 1:40 UTC
31 points
1 comment10 min readLW link

Safety stan­dards: a frame­work for AI regulation

joshc1 May 2023 0:56 UTC
19 points
0 comments8 min readLW link

Are short timelines ac­tu­ally bad?

joshc5 Feb 2023 21:21 UTC
61 points
7 comments3 min readLW link

[MLSN #7]: an ex­am­ple of an emer­gent in­ter­nal optimizer

9 Jan 2023 19:39 UTC
28 points
0 comments6 min readLW link

Prizes for ML Safety Bench­mark Ideas

joshc28 Oct 2022 2:51 UTC
36 points
5 comments1 min readLW link

[Question] What is the best cri­tique of AI ex­is­ten­tial risk ar­gu­ments?

joshc30 Aug 2022 2:18 UTC
6 points
11 comments1 min readLW link