joshc

Karma: 1,744

If anyone builds it, everyone will plausibly be fine

joshc18 Sep 2025 20:03 UTC

29 points

24 comments7 min readLW link

Recent Redwood Research project proposals

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman , Tyler Tracy, Aryan Bhatt and Joey Yudelson

14 Jul 2025 22:27 UTC

91 points

0 comments3 min readLW link

Alignment faking CTFs: Apply to my MATS stream

joshc4 Apr 2025 16:29 UTC

61 points

0 comments4 min readLW link

Training AI to do alignment research we don’t already know how to do

joshc24 Feb 2025 19:19 UTC

45 points

24 comments7 min readLW link

How might we safely pass the buck to AI?

joshc19 Feb 2025 17:48 UTC

83 points

58 comments31 min readLW link

How AI Takeover Might Happen in 2 Years

joshc7 Feb 2025 17:10 UTC

432 points

140 comments29 min readLW link

(x.com)

Takeaways from sketching a control safety case

joshc31 Jan 2025 4:43 UTC

28 points

0 comments3 min readLW link

(redwoodresearch.substack.com)

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

30 Jan 2025 17:28 UTC

57 points

0 comments5 min readLW link

Planning for Extreme AI Risks

joshc29 Jan 2025 18:33 UTC

143 points

5 comments16 min readLW link

When does capability elicitation bound risk?

joshc22 Jan 2025 3:42 UTC

25 points

0 comments17 min readLW link

(redwoodresearch.substack.com)

Extending control evaluations to non-scheming threats

joshc12 Jan 2025 1:42 UTC

30 points

1 comment12 min readLW link

New report: Safety Cases for AI

joshc20 Mar 2024 16:45 UTC

91 points

14 comments1 min readLW link

(twitter.com)

List of strategies for mitigating deceptive alignment

joshc2 Dec 2023 5:56 UTC

40 points

2 comments6 min readLW link

New paper shows truthfulness & instruction-following don’t generalize by default

joshc19 Nov 2023 19:27 UTC

60 points

0 comments4 min readLW link

Testbed evals: evaluating AI safety even when it can’t be directly measured

joshc15 Nov 2023 19:00 UTC

72 points

2 comments4 min readLW link

Red teaming: challenges and research directions

joshc10 May 2023 1:40 UTC

31 points

1 comment10 min readLW link

Safety standards: a framework for AI regulation

joshc1 May 2023 0:56 UTC

19 points

0 comments8 min readLW link

Are short timelines actually bad?

joshc5 Feb 2023 21:21 UTC

61 points

7 comments3 min readLW link

[MLSN #7]: an example of an emergent internal optimizer

joshc and Dan H

9 Jan 2023 19:39 UTC

28 points

0 comments6 min readLW link

Prizes for ML Safety Benchmark Ideas

joshc28 Oct 2022 2:51 UTC

36 points

5 comments1 min readLW link