Reward Hacking

WikiLast edit: 11 Feb 2026 23:53 UTC by Joschka Braun

Reward hacking, also known as specification gaming, occurs when an AI trained with reinforcement learning optimizes an objective function — achieving the literal, formal specification of an objective — without actually achieving the outcome that the programmers intended.