RSS

Julian Stastny

Karma: 1,040

member of technical staff @ redwood research

Ad­vice for mak­ing ro­bust-to-train­ing model organisms

28 May 2026 17:26 UTC
38 points
8 comments12 min readLW link
(blog.redwoodresearch.org)

Why does off-model SFT de­grade ca­pa­bil­ities?

21 May 2026 0:35 UTC
42 points
9 comments6 min readLW link

In­crim­i­nat­ing mis­al­igned AI mod­els via distillation

15 May 2026 21:43 UTC
115 points
12 comments5 min readLW link

Re­search Sab­o­tage in ML Codebases

30 Apr 2026 0:26 UTC
62 points
3 comments6 min readLW link

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
81 points
4 comments7 min readLW link

An Em­piri­cal Study of Meth­ods for SFTing Opaque Rea­son­ing Models

24 Apr 2026 17:26 UTC
17 points
0 comments6 min readLW link

How do LLMs gen­er­al­ize when we do train­ing that is in­tu­itively com­pat­i­ble with two off-dis­tri­bu­tion be­hav­iors?

20 Apr 2026 16:58 UTC
62 points
5 comments20 min readLW link

Five ap­proaches to eval­u­at­ing train­ing-based con­trol measures

18 Apr 2026 1:07 UTC
22 points
0 comments6 min readLW link

Logit ROCs: Mon­i­tor TPR is lin­ear in FPR in logit space

15 Apr 2026 1:57 UTC
25 points
0 comments7 min readLW link
(blog.redwoodresearch.org)

Model or­ganisms re­searchers should check whether high LRs defeat their model organisms

10 Apr 2026 0:07 UTC
40 points
0 comments5 min readLW link

How do we (more) safely defer to AIs?

12 Feb 2026 16:55 UTC
83 points
5 comments72 min readLW link

Method­olog­i­cal con­sid­er­a­tions in mak­ing ma­lign ini­tial­iza­tions for con­trol research

24 Dec 2025 1:18 UTC
17 points
0 comments13 min readLW link

Prospects for study­ing ac­tual schemers

19 Sep 2025 14:11 UTC
40 points
2 comments58 min readLW link

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
25 points
0 comments18 min readLW link
(alignmentproject.aisi.gov.uk)

Re­cent Red­wood Re­search pro­ject proposals

14 Jul 2025 22:27 UTC
93 points
0 comments3 min readLW link

Linkpost: Red­wood Re­search read­ing list

Julian Stastny10 Jul 2025 18:39 UTC
50 points
0 comments1 min readLW link
(redwoodresearch.substack.com)

What’s worse, spies or schemers?

9 Jul 2025 14:37 UTC
51 points
2 comments5 min readLW link

Two pro­posed pro­jects on ab­stract analo­gies for scheming

Julian Stastny4 Jul 2025 16:03 UTC
49 points
0 comments3 min readLW link

Mak­ing deals with early schemers

20 Jun 2025 18:21 UTC
129 points
41 comments15 min readLW link

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

8 May 2025 19:06 UTC
80 points
3 comments15 min readLW link