Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 19 Jan 2025 2:55 UTC
1 point
0
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.