Backpropagation of the error gradient is more similar to nociception/torture than evolution by random mutation.I’ve to check how RLHF is made...EDIT : error backpropagation is the workhorse behind reward learning, and policy update.The NN is punished for not doing as well as it could have.
Backpropagation of the error gradient is more similar to nociception/torture than evolution by random mutation.
I’ve to check how RLHF is made...
EDIT : error backpropagation is the workhorse behind reward learning, and policy update.
The NN is punished for not doing as well as it could have.