Daniel Kokotajlo comments on Claude Mythos Preview System Card

Daniel Kokotajlo 8 Apr 2026 5:11 UTC
17 points
0
In their Risk Report, they admit that they don’t do fine-tuning as part of their evals. So, if the model was misaligned, it could simply sandbag.
- papetoast 8 Apr 2026 7:30 UTC
  9 points
  −10
  Parent
  Wait isn’t not fine-tuning on evals good? Sandbag is a valid concern but I feel like fine-tuning would be worse.
  - Daniel Kokotajlo 8 Apr 2026 12:57 UTC
    6 points
    4
    Parent
    You don’t fine-tune on the eval as part of the production model release, you just do a small amount of fine-tuning only for purposes of the eval (i.e. a branch off the production model lineage) to overcome any sandbagging that might be occurring.
  - StanislavKrym 8 Apr 2026 7:32 UTC
    4 points
    2
    Parent
    Suppose that we create two copies of Mythos and finetune one on evals in order to elicit capabilities. Then the OTHER copy doesn’t learn anything about the evals, while we learn the copy’s dangerous abilities.