Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
gasteigerjo
Karma:
349
Working on Alignment Science at Anthropic
All
Posts
Comments
New
Top
Old
Page
1
AI Safety at the Frontier: Paper Highlights of April 2026
gasteigerjo
6 May 2026 13:58 UTC
18
points
1
comment
10
min read
LW
link
AI Safety at the Frontier: Paper Highlights of February & March 2026
gasteigerjo
4 Apr 2026 14:58 UTC
8
points
0
comments
12
min read
LW
link
AI Safety at the Frontier: Paper Highlights of January 2026
gasteigerjo
3 Feb 2026 18:56 UTC
22
points
0
comments
9
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights of December 2025
gasteigerjo
14 Jan 2026 14:29 UTC
16
points
0
comments
7
min read
LW
link
(aisafetyfrontier.substack.com)
Towards training-time mitigations for alignment faking in RL
Vlad Mikulik
,
gasteigerjo
,
Hoagy
,
Joe Benton
,
Benjamin Wright
,
Jonathan Uesato
,
Monte M
,
Fabien Roger
and
evhub
16 Dec 2025 21:01 UTC
39
points
1
comment
5
min read
LW
link
(alignment.anthropic.com)
AI Safety at the Frontier: Paper Highlights of November 2025
gasteigerjo
2 Dec 2025 21:11 UTC
6
points
0
comments
8
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights of October 2025
gasteigerjo
5 Nov 2025 13:39 UTC
7
points
0
comments
8
min read
LW
link
(aisafetyfrontier.substack.com)
Training fails to elicit subtle reasoning in current language models
mishajw
,
Fabien Roger
,
Hoagy
,
gasteigerjo
,
Joe Benton
and
Vlad Mikulik
9 Oct 2025 19:04 UTC
49
points
3
comments
4
min read
LW
link
(alignment.anthropic.com)
AI Safety at the Frontier: Paper Highlights, September ’25
gasteigerjo
1 Oct 2025 16:24 UTC
11
points
0
comments
6
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, August ’25
gasteigerjo
2 Sep 2025 20:29 UTC
12
points
0
comments
7
min read
LW
link
(open.substack.com)
AI Safety at the Frontier: Paper Highlights, July ’25
gasteigerjo
10 Aug 2025 12:49 UTC
7
points
0
comments
9
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, June ’25
gasteigerjo
7 Jul 2025 18:17 UTC
4
points
0
comments
7
min read
LW
link
(open.substack.com)
AI Safety at the Frontier: Paper Highlights, May ’25
gasteigerjo
17 Jun 2025 17:16 UTC
6
points
0
comments
8
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, April ’25
gasteigerjo
6 May 2025 14:22 UTC
4
points
0
comments
7
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, March ’25
gasteigerjo
7 Apr 2025 20:17 UTC
9
points
0
comments
9
min read
LW
link
(aisafetyfrontier.substack.com)
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
and
Fabien Roger
26 Mar 2025 19:13 UTC
44
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
AI Safety at the Frontier: Paper Highlights, February ’25
gasteigerjo
3 Mar 2025 22:09 UTC
7
points
0
comments
7
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, January ’25
gasteigerjo
11 Feb 2025 16:14 UTC
7
points
0
comments
8
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, December ’24
gasteigerjo
11 Jan 2025 22:54 UTC
7
points
2
comments
7
min read
LW
link
(aisafetyfrontier.substack.com)
AI Safety at the Frontier: Paper Highlights, November ’24
gasteigerjo
7 Dec 2024 19:15 UTC
7
points
0
comments
8
min read
LW
link
(aisafetyfrontier.substack.com)
Back to top
Next