The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.
The only thing that is changed is that instead of using regular RL (which we can’t do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward “1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn’t think it’s obvious if it’s young or old, 0 otherwise”.
The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue.
The only thing that is changed is that instead of using regular RL (which we can’t do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward “1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn’t think it’s obvious if it’s young or old, 0 otherwise”.