The Qwen3-Coder-Next tech report has an interesting example on reward hacking, where models increasingly try to access the forbidden git history as RL progresses
Reinforced Reward Hacking Blocker. Prior work has shown that GitHub-based environments may unintentionally leak future commit information , which agents can exploit to recover ground-truth fixes (e.g., via git log—all). To mitigate this, we adopt standard protections including removing remotes, branches, and tags.
During later RL stages, however, many new ways of reward hacking emerge. Agents attempt to reconnect local repositories to GitHub using commands such as git remote add, or retrieve commit history through git clone, curl, or similar tools, as illustrated in Figure 8.
One thing I am curious about is whether the full set of personas transfers over with distillation or only the main “Assistant persona”? If it leans towards the latter, does that suggest that models which are fully pre-trained via distillation are by default more aligned?