During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model’s performance, this reward aligns with human preferences, making it more readable.
I also found this trade-off between human readability and performance noteworthy.
Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I’m not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.
I also found this trade-off between human readability and performance noteworthy.
Side note: Claude 3.5 Sonnet does CoT language-mixing after a bit of prompting and convincing. I’m not sure about effects on performance. Also the closeness narratively implied by having it imitate the idiosyncratic mixture I was using to talk to it probably exacerbated sycophancy.