RSS

Jozdien

Karma: 2,486

Real­is­tic Re­ward Hack­ing In­duces Differ­ent and Deeper Misalignment

Jozdien9 Oct 2025 18:45 UTC
122 points
2 comments23 min readLW link

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
152 points
35 comments2 min readLW link