RSS

Julian Stastny

Karma: 890

member of technical staff @ redwood research

Re­search Sab­o­tage in ML Codebases

30 Apr 2026 0:26 UTC
62 points
2 comments6 min readLW link

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
79 points
4 comments7 min readLW link

An Em­piri­cal Study of Meth­ods for SFTing Opaque Rea­son­ing Models

24 Apr 2026 17:26 UTC
17 points
0 comments6 min readLW link

How do LLMs gen­er­al­ize when we do train­ing that is in­tu­itively com­pat­i­ble with two off-dis­tri­bu­tion be­hav­iors?

20 Apr 2026 16:58 UTC
61 points
5 comments20 min readLW link

Five ap­proaches to eval­u­at­ing train­ing-based con­trol measures

18 Apr 2026 1:07 UTC
19 points
0 comments6 min readLW link

Logit ROCs: Mon­i­tor TPR is lin­ear in FPR in logit space

15 Apr 2026 1:57 UTC
25 points
0 comments7 min readLW link
(blog.redwoodresearch.org)

Model or­ganisms re­searchers should check whether high LRs defeat their model organisms

10 Apr 2026 0:07 UTC
40 points
0 comments5 min readLW link

How do we (more) safely defer to AIs?

12 Feb 2026 16:55 UTC
82 points
5 comments72 min readLW link

Method­olog­i­cal con­sid­er­a­tions in mak­ing ma­lign ini­tial­iza­tions for con­trol research

24 Dec 2025 1:18 UTC
16 points
0 comments13 min readLW link

Prospects for study­ing ac­tual schemers

19 Sep 2025 14:11 UTC
40 points
2 comments58 min readLW link

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
25 points
0 comments18 min readLW link
(alignmentproject.aisi.gov.uk)

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
93 points
0 comments3 min readLW link

Linkpost: Red­wood Re­search read­ing list

Julian Stastny10 Jul 2025 18:39 UTC
50 points
0 comments1 min readLW link
(redwoodresearch.substack.com)

What’s worse, spies or schemers?

9 Jul 2025 14:37 UTC
51 points
2 comments5 min readLW link

Two pro­posed pro­jects on ab­stract analo­gies for scheming

Julian Stastny4 Jul 2025 16:03 UTC
49 points
0 comments3 min readLW link

Mak­ing deals with early schemers

20 Jun 2025 18:21 UTC
129 points
41 comments15 min readLW link

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
80 points
3 comments15 min readLW link

7+ tractable di­rec­tions in AI control

28 Apr 2025 17:12 UTC
93 points
1 comment13 min readLW link

Disen­tan­gling four mo­ti­va­tions for act­ing in ac­cor­dance with UDT

Julian Stastny5 Nov 2023 21:26 UTC
46 points
4 comments7 min readLW link