RSS

Sam Marks

Karma: 4,470

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
144 points
30 comments2 min readLW link

Petri: An open-source au­dit­ing tool to ac­cel­er­ate AI safety research

Sam Marks7 Oct 2025 20:39 UTC
77 points
0 comments1 min readLW link
(alignment.anthropic.com)

Elic­it­ing se­cret knowl­edge from lan­guage models

2 Oct 2025 20:57 UTC
68 points
3 comments2 min readLW link
(arxiv.org)

Dis­cov­er­ing Back­door Triggers

19 Aug 2025 6:24 UTC
57 points
4 comments13 min readLW link