joanv
Good catch. Note also that Mythos was made available for internal (agentic) use on Feb 24th. Conditional on 4.1.4 in the system card (Alignment assessment before internal deployment), it means that they had a hunch for the capabilities of the model (see “Given the very significant capabilities progress that we observed during training, [...]”), assessed its alignment in this 24hrs period and concluded it was ok. I have many question marks: at the bare minimum there’s some inconsistency.
joanv’s Shortform
From Claude Mythos‘ system card:
“Following a successful alignment review, the first early version of Claude Mythos Preview was made available for internal use on February 24.”
Anthropic’s RSP was also updated on Feb 24th.
i’d appreciate clarity on what looks like a funny coincidence.
[Paper] When can we trust untrusted monitoring? A safety case sketch across collusion strategies
>Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.
I would argue that instead human priors serve as a mechanism to help the search process, as it’s being shown with cold-started reasoning models: they bake-in some reasoning traces that the model can then learn to exploit via RL. While this is not very bitter lesson-esque, the solution space is so large that it’d probably be quite difficult to do so without the cold start phase (although R1-zero kind of hints at this being possible). Maybe we have not yet thrown as much compute at the problem to do this search from scratch effectively.
What is the threat model you are probing for here?
The prompting approach suffers from LLMs’ shallowness in strategic thinking; I would not be surprised if, prompting a language model several times to come up with an encoding scheme would result in a highly concentrated distribution.
> A saving grace may be that ‘detecting steg is happening at all’ seems easier than ’recovering the plaintext message’
While i agree in principle, as in all red/blue team interactions, you may not want your blue team to be trigger happy; you still want to be able to extract some usefulness out of your scheming models, thus making this hope somewhat less robust. This would be a cool research question nonetheless.
> Both of these [steganography inducing methods] involve finetuning the model
Note that in Hidden in Plain Text, we also work with in-context RL, which is not explicitly finetuning the model, but just giving rewards to rollouts in context.
Moreover, in this paradigm, forms of hidden reasoning seem likely to emerge: in multi-step reasoning, for example, the model might find it efficient to compress backtracking or common reasoning cues into cryptic tokens (e.g., “Hmmm”) as a kind of shorthand to encode arbitrarily dense or unclear information. This is especially true under financial pressures to compress/shorten the Chains-of-Thought, thus allowing models to perform potentially long serial reasoning outside of human/AI oversight.
Some mixed feelings about Anthropic’s Auto-Alignment Researcher post here:
TLDR: if we really want to turn alignment in a “number go up game”, we better make sure evals/envs we use are water-tight (which tbc may be easier than “solving” alignment), and I claim that this eval/env may not be super robust.
https://x.com/joanvelja/status/2044231292104126911?s=46