We haven’t experimented with models other than Qwen3-4B, but we will in follow-up work.
On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.
We haven’t experimented with models other than Qwen3-4B, but we will in follow-up work.
On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.