Added the “Distillation & Pedagogy” tag.
I’ve listened to the “Reward is not the Optimisation Target” post a few times, but I still found this enlightening.
That post had this as its core thesis:
Reward “upweight certain kinds of actions in certain kinds of situations, and therefore reward chisels cognitive grooves into agents”
While I parse the core thesis of this post as:
The current implementation of reinforcement learning selects for behaviour that was scored highly by a reward function.
The models so selected do not necessarily optimise for reward (or for anything at all).
I find the meme of “reward as selection” more native[1] than Turntrout’s meme of “reward as chisel”.
Henceforth, I’ll be linking this post whenever I want to explain that RL agents don’t optimise for their reward signal within the current paradigm.
[1]: It’s more intuitive/easier to grasp and fits my conception of ML training as an optimisation process.
Added the “Distillation & Pedagogy” tag.
I’ve listened to the “Reward is not the Optimisation Target” post a few times, but I still found this enlightening.
That post had this as its core thesis:
While I parse the core thesis of this post as:
The models so selected do not necessarily optimise for reward (or for anything at all).
I find the meme of “reward as selection” more native[1] than Turntrout’s meme of “reward as chisel”.
Henceforth, I’ll be linking this post whenever I want to explain that RL agents don’t optimise for their reward signal within the current paradigm.
[1]: It’s more intuitive/easier to grasp and fits my conception of ML training as an optimisation process.