As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
As originally conceived, this is sort of like a “dangerous capability” eval for steg. The argument being that, if a model can do steg in this very toy setting where we’ve significantly nudged the model, it might do steg in more realistic scenarios.
There is also a claim here of the form “language models can decode things from their own encodings that other language models cannot, due to having access to privileged internal information”.
I didnt’ make these points especially clear in the slide deck—thanks for the feedback!
Agreed on the rest of points!
>As originally conceived, this is sort of like a “dangerous capability” eval for steg.
I am actually just about to start building something very similar to this for the AISI’s evals bounty program.