RSS

Julian Stastny

Karma: 890

member of technical staff @ redwood research

Re­search Sab­o­tage in ML Codebases

30 Apr 2026 0:26 UTC
62 points
2 comments6 min readLW link

Sleeper Agent Back­door Re­sults Are Messy

28 Apr 2026 1:55 UTC
79 points
4 comments7 min readLW link

An Em­piri­cal Study of Meth­ods for SFTing Opaque Rea­son­ing Models

24 Apr 2026 17:26 UTC
17 points
0 comments6 min readLW link

How do LLMs gen­er­al­ize when we do train­ing that is in­tu­itively com­pat­i­ble with two off-dis­tri­bu­tion be­hav­iors?

20 Apr 2026 16:58 UTC
61 points
5 comments20 min readLW link

Five ap­proaches to eval­u­at­ing train­ing-based con­trol measures

18 Apr 2026 1:07 UTC
19 points
0 comments6 min readLW link

Logit ROCs: Mon­i­tor TPR is lin­ear in FPR in logit space

15 Apr 2026 1:57 UTC
25 points
0 comments7 min readLW link
(blog.redwoodresearch.org)

Model or­ganisms re­searchers should check whether high LRs defeat their model organisms

10 Apr 2026 0:07 UTC
40 points
0 comments5 min readLW link

How do we (more) safely defer to AIs?

12 Feb 2026 16:55 UTC
82 points
5 comments72 min readLW link

Method­olog­i­cal con­sid­er­a­tions in mak­ing ma­lign ini­tial­iza­tions for con­trol research

24 Dec 2025 1:18 UTC
16 points
0 comments13 min readLW link

Prospects for study­ing ac­tual schemers

19 Sep 2025 14:11 UTC
40 points
2 comments58 min readLW link

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
25 points
0 comments18 min readLW link
(alignmentproject.aisi.gov.uk)