Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.
This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)
Experimenting with cross-posting reviews from https://www.papertrailshq.com/
[Review] Replication of the Auditing Game Model Organism
https://alignment.anthropic.com/2025/auditing-mo-replication/
★★★★☆ (4.0/5)
Valuable for the community, and new scientific contributions too: in particular, that adversarial training against profile attacks (and other inputs) generalizes to robustness against user persona and “base model” sampling.
This generalization is interesting, though I’m curious whether profile robustness could be induced without directly training against it (by e.g. training on synthetic documents with information about profile attacks)
https://www.papertrailshq.com/papers/cmjw01xv7000bjv04o14vzagx#reviews