RSS

joshc

Karma: 1,744

If any­one builds it, ev­ery­one will plau­si­bly be fine

joshc18 Sep 2025 20:03 UTC
29 points
24 comments7 min readLW link

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
91 points
0 comments3 min readLW link

Align­ment fak­ing CTFs: Ap­ply to my MATS stream

joshc4 Apr 2025 16:29 UTC
61 points
0 comments4 min readLW link

Train­ing AI to do al­ign­ment re­search we don’t already know how to do

joshc24 Feb 2025 19:19 UTC
45 points
24 comments7 min readLW link

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC
83 points
58 comments31 min readLW link

How AI Takeover Might Hap­pen in 2 Years

joshc7 Feb 2025 17:10 UTC
432 points
140 comments29 min readLW link
(x.com)

Take­aways from sketch­ing a con­trol safety case

joshc31 Jan 2025 4:43 UTC
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
57 points
0 comments5 min readLW link

Plan­ning for Ex­treme AI Risks

joshc29 Jan 2025 18:33 UTC
143 points
5 comments16 min readLW link

When does ca­pa­bil­ity elic­i­ta­tion bound risk?

joshc22 Jan 2025 3:42 UTC
25 points
0 comments17 min readLW link
(redwoodresearch.substack.com)

Ex­tend­ing con­trol eval­u­a­tions to non-schem­ing threats

joshc12 Jan 2025 1:42 UTC
30 points
1 comment12 min readLW link

New re­port: Safety Cases for AI

joshc20 Mar 2024 16:45 UTC
91 points
14 comments1 min readLW link
(twitter.com)

List of strate­gies for miti­gat­ing de­cep­tive alignment

joshc2 Dec 2023 5:56 UTC
40 points
2 comments6 min readLW link

New pa­per shows truth­ful­ness & in­struc­tion-fol­low­ing don’t gen­er­al­ize by default

joshc19 Nov 2023 19:27 UTC
60 points
0 comments4 min readLW link

Testbed evals: eval­u­at­ing AI safety even when it can’t be di­rectly mea­sured

joshc15 Nov 2023 19:00 UTC
72 points
2 comments4 min readLW link

Red team­ing: challenges and re­search directions

joshc10 May 2023 1:40 UTC
31 points
1 comment10 min readLW link

Safety stan­dards: a frame­work for AI regulation

joshc1 May 2023 0:56 UTC
19 points
0 comments8 min readLW link

Are short timelines ac­tu­ally bad?

joshc5 Feb 2023 21:21 UTC
61 points
7 comments3 min readLW link

[MLSN #7]: an ex­am­ple of an emer­gent in­ter­nal optimizer

9 Jan 2023 19:39 UTC
28 points
0 comments6 min readLW link

Prizes for ML Safety Bench­mark Ideas

joshc28 Oct 2022 2:51 UTC
36 points
5 comments1 min readLW link