RSS

Ex­plo­ra­tion Hacking

TagLast edit: 11 Feb 2026 23:44 UTC by Joschka Braun

Exploration hacking is when a model strategically alters its exploration during RL training in order to influence the subsequent training outcome. Because RL is fundamentally dependent on sufficient exploration of diverse actions and trajectories, a model that alters its exploration behavior can significantly compromise the training outcome.

Exploration hacking relates to several other threat models. It can be a strategy for sandbagging during RL-based capability elicitation, but unlike sandbagging is not limited to underperformance. Unlike reward hacking, it is intentional. Unlike gradient hacking, it manipulates the data distribution rather than the optimization dynamics directly.

Shap­ing the ex­plo­ra­tion of the mo­ti­va­tion-space mat­ters for AI safety

6 Mar 2026 14:43 UTC
77 points
13 comments10 min readLW link

A Con­cep­tual Frame­work for Ex­plo­ra­tion Hacking

12 Feb 2026 16:33 UTC
25 points
2 comments9 min readLW link

Ex­plo­ra­tion hack­ing: can rea­son­ing mod­els sub­vert RL?

30 Jul 2025 22:02 UTC
22 points
4 comments9 min readLW link
No comments.