Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.
Feedback from various people: the “opaqueness” criterion is really important. Simple schemes like jargon, speaking in a different language, or redefining words via OOCR aren’t sufficiently “opaque” because a human overseer could still understand the CoT if effort was expended.
This makes model organisms of steg really hard to make IMO.
OTOH I think we don’t need “strict opaqueness” for CoT unfaithfulness to be concerning.
Maybe we need to borrow an idea from AI control and assume an oversight budget; under this model, even if a model doesn’t engage in full-blown steganography, increased amounts of obfuscation (like using more jargon) might still make it impractical to verify the full CoT.