Exploration Hacking

TagLast edit: 11 Feb 2026 23:44 UTC by Joschka Braun

Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.

Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.