Possible setup to test steganography capabilities:
Pick two tasks which model is incapable of doing without CoT. Show two task statements, require it to solve both problems with a single CoT and give two distinct answers to both tasks.
Then show the judge (a human or a model) only one pair question → shared CoT → answer, and require the judge to guess what was the other task.
You can do some variations on that, train the model to be better at this, train both models adversarially.
Or you can make some portion of tests to be just the one problem to solve, and part two problems. The judge has to guess was there another problem and if so what problem.
Possible setup to test steganography capabilities:
Pick two tasks which model is incapable of doing without CoT. Show two task statements, require it to solve both problems with a single CoT and give two distinct answers to both tasks.
Then show the judge (a human or a model) only one pair question → shared CoT → answer, and require the judge to guess what was the other task.
You can do some variations on that, train the model to be better at this, train both models adversarially.
Or you can make some portion of tests to be just the one problem to solve, and part two problems. The judge has to guess was there another problem and if so what problem.