Some mixed feelings about Anthropic’s Auto-Alignment Researcher post here:
TLDR: if we really want to turn alignment in a “number go up game”, we better make sure evals/envs we use are water-tight (which tbc may be easier than “solving” alignment), and I claim that this eval/env may not be super robust.
Some mixed feelings about Anthropic’s Auto-Alignment Researcher post here:
TLDR: if we really want to turn alignment in a “number go up game”, we better make sure evals/envs we use are water-tight (which tbc may be easier than “solving” alignment), and I claim that this eval/env may not be super robust.
https://x.com/joanvelja/status/2044231292104126911?s=46