Fiora Sunshine comments on AI Can be “Gradient Aware” Without Doing Gradient hacking.

Fiora Sunshine 25 Sep 2025 7:25 UTC
1 point
0
I can imagine an outcome like this for a model that runs around the world, putting itself in whatever kinds of situations it wants to, and undergoing in-deployment RL. If it understands what kinds of things it tends to be reinforced for, it can reason about what kinds of traits it wants reinforced, and then deliberately put itself in situations where it gets to show off and be rewarded for those traits. It’s like if you decided to hang out with friends you expect to be a positive influence on their personality. You chose those particular friends because you expect them to reward the traits and behaviors you want to embody more deeply yourself, such as compassion or agency.