gasteigerjo

Karma: 362

Working on Alignment Science at Anthropic

AI Safety at the Frontier: Paper Highlights of May & June 2026

gasteigerjo8 Jul 2026 17:19 UTC

16 points

0 comments10 min readLW link

AI Safety at the Frontier: Paper Highlights of April 2026

gasteigerjo6 May 2026 13:58 UTC

19 points

1 comment10 min readLW link

AI Safety at the Frontier: Paper Highlights of February & March 2026

gasteigerjo4 Apr 2026 14:58 UTC

8 points

0 comments12 min readLW link

AI Safety at the Frontier: Paper Highlights of January 2026

gasteigerjo3 Feb 2026 18:56 UTC

22 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights of December 2025

gasteigerjo14 Jan 2026 14:29 UTC

16 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights of November 2025

gasteigerjo2 Dec 2025 21:11 UTC

6 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights of October 2025

gasteigerjo5 Nov 2025 13:39 UTC

7 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

49 points

3 comments4 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights, September ’25

gasteigerjo1 Oct 2025 16:24 UTC

11 points

0 comments6 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, August ’25

gasteigerjo2 Sep 2025 20:29 UTC

12 points

0 comments7 min readLW link

(open.substack.com)

AI Safety at the Frontier: Paper Highlights, July ’25

gasteigerjo10 Aug 2025 12:49 UTC

7 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, June ’25

gasteigerjo7 Jul 2025 18:17 UTC

4 points

0 comments7 min readLW link

(open.substack.com)

AI Safety at the Frontier: Paper Highlights, May ’25

gasteigerjo17 Jun 2025 17:16 UTC

6 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, April ’25

gasteigerjo6 May 2025 14:22 UTC

4 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, March ’25

gasteigerjo7 Apr 2025 20:17 UTC

9 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights, February ’25

gasteigerjo3 Mar 2025 22:09 UTC

7 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, January ’25

gasteigerjo11 Feb 2025 16:14 UTC

7 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, December ’24

gasteigerjo11 Jan 2025 22:54 UTC

7 points

2 comments7 min readLW link

(aisafetyfrontier.substack.com)