gasteigerjo

Karma: 349

Working on Alignment Science at Anthropic

AI Safety at the Frontier: Paper Highlights of April 2026

gasteigerjo6 May 2026 13:58 UTC

18 points

1 comment10 min readLW link

AI Safety at the Frontier: Paper Highlights of February & March 2026

gasteigerjo4 Apr 2026 14:58 UTC

8 points

0 comments12 min readLW link

AI Safety at the Frontier: Paper Highlights of January 2026

gasteigerjo3 Feb 2026 18:56 UTC

22 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights of December 2025

gasteigerjo14 Jan 2026 14:29 UTC

16 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights of November 2025

gasteigerjo2 Dec 2025 21:11 UTC

6 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights of October 2025

gasteigerjo5 Nov 2025 13:39 UTC

7 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

49 points

3 comments4 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights, September ’25

gasteigerjo1 Oct 2025 16:24 UTC

11 points

0 comments6 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, August ’25

gasteigerjo2 Sep 2025 20:29 UTC

12 points

0 comments7 min readLW link

(open.substack.com)

AI Safety at the Frontier: Paper Highlights, July ’25

gasteigerjo10 Aug 2025 12:49 UTC

7 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, June ’25

gasteigerjo7 Jul 2025 18:17 UTC

4 points

0 comments7 min readLW link

(open.substack.com)

AI Safety at the Frontier: Paper Highlights, May ’25

gasteigerjo17 Jun 2025 17:16 UTC

6 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, April ’25

gasteigerjo6 May 2025 14:22 UTC

4 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, March ’25

gasteigerjo7 Apr 2025 20:17 UTC

9 points

0 comments9 min readLW link

(aisafetyfrontier.substack.com)

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

AI Safety at the Frontier: Paper Highlights, February ’25

gasteigerjo3 Mar 2025 22:09 UTC

7 points

0 comments7 min readLW link

(aisafetyfrontier.substack.com)

AI Safety at the Frontier: Paper Highlights, January ’25

gasteigerjo11 Feb 2025 16:14 UTC

7 points

0 comments8 min readLW link

(aisafetyfrontier.substack.com)

gasteigerjo 11 Feb 2025 16:11 UTC
1 point
0
in reply to: RandyOrion’s comment on: AI Safety at the Frontier: Paper Highlights, December ’24
This is normal. Workshops are non-archival and conferences only require that the work hasn’t been submitted to any archival venues.
[edit to extend]: Researchers will often submit their work to a conference and one or even multiple workshops in parallel. Workshops are great at getting a more targeted audience and discussion. It’s also a strategy to get more people to see your paper.

gasteigerjo 12 Jan 2025 17:56 UTC
1 point
0
in reply to: habryka’s comment on: (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser
I’ll also give at least £5k if tax deductibility is set up in the UK.