I agree with the caveats mentioned and appreciate you mentioning them. One important point is that the runs with the judge penalty were for only 10000 episodes. I was able to do a run for 40000 episodes with the judge penalty, and the CoT judge penalty dropped to 0.05 with a pass rate of 87% (which is higher than the 82.6% mean pass rate for our 10000 episode runs in the post). I’m reluctant to infer too much about the competitiveness of obfuscatory rewards from our results, as it seems plausible that the effect could become stronger by training for longer.
We did not have to try hard to get these results. Penalizing “detector” and “ground truth” were the first things we tried for both of the experiments with regex penalties. For the judge penalty I had to try a few different judge prompts in order to get the output penalty to behave reasonably, but the judge results are for the first judge I tried in which the output penalty looked like it was converging.
I agree with the caveats mentioned and appreciate you mentioning them. One important point is that the runs with the judge penalty were for only 10000 episodes. I was able to do a run for 40000 episodes with the judge penalty, and the CoT judge penalty dropped to 0.05 with a pass rate of 87% (which is higher than the 82.6% mean pass rate for our 10000 episode runs in the post). I’m reluctant to infer too much about the competitiveness of obfuscatory rewards from our results, as it seems plausible that the effect could become stronger by training for longer.
We did not have to try hard to get these results. Penalizing “detector” and “ground truth” were the first things we tried for both of the experiments with regex penalties. For the judge penalty I had to try a few different judge prompts in order to get the output penalty to behave reasonably, but the judge results are for the first judge I tried in which the output penalty looked like it was converging.