Given the details of Claude Mythos Preview making their way into Opus 4.7′s System Card, I’d like to ask @Dave Orr or other safetyists at Anthropic the following questions:
If you are familiar with Greenblatt’s article “Current AIs seem pretty misaligned to me”, then how similar are Greenblatt’s impressions to Anthropic’s impressions when using Opus 4.7 and/or Mythos Preview? Section 2.3.6 from Opus 4.7′s card dedicated to Mythos seems to be similar to Greenblatt’s observations.
What weight-level mitigations could one apply and what mitigations are already in place at Anthropic? Suppose that a model is taught to both classify outputs as cheating and to avoid outputs which will be classified as such by the model itself. If such an experiment was already conducted, then what was the result?
Given the details of Claude Mythos Preview making their way into Opus 4.7′s System Card, I’d like to ask @Dave Orr or other safetyists at Anthropic the following questions:
If you are familiar with Greenblatt’s article “Current AIs seem pretty misaligned to me”, then how similar are Greenblatt’s impressions to Anthropic’s impressions when using Opus 4.7 and/or Mythos Preview? Section 2.3.6 from Opus 4.7′s card dedicated to Mythos seems to be similar to Greenblatt’s observations.
What weight-level mitigations could one apply and what mitigations are already in place at Anthropic? Suppose that a model is taught to both classify outputs as cheating and to avoid outputs which will be classified as such by the model itself. If such an experiment was already conducted, then what was the result?