RSS

Jozdien

Karma: 2,405

Real­is­tic Re­ward Hack­ing In­duces Differ­ent and Deeper Misalignment

Jozdien9 Oct 2025 18:45 UTC
99 points
2 comments23 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
141 points
26 comments2 min readLW link