reward chisels cognitive grooves into an agent
This makes sense, but if the agent is smart enough to know how it *could* wirehead, perhaps wireheading would eventually result from the chiseling of some highly abstract grooves.
To give an example, suppose you go to Domino’s pizza on Saturday at 6pm and eat some Hawaiian pizza. You enjoy the pizza. This reinforces the behaviour of “Go to Domino’s pizza on Saturday at 6pm and eat some Hawaiian pizza”.
Surely this will also reinforce other more generic behaviours, that include this behaviour as a special case, such as:
“Go to a pizza place in the evening and eat pizza.”
“Go to a restaurant and eat yummy food.”
Well then, why not “do a thing that I know will make me feel good”: that includes the original behaviour as a special case. It also includes wireheading.
(this is a different explanation of a similar point made in this comment from hillz: https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target?commentId=oZ6aX3bzNF5bwvL4S but it seemed different enough to be worth a separate comment)
Could you give some examples?