Noosphere89 comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Noosphere89 10 Jan 2025 23:53 UTC
10 points
0
How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.

You may answer here:

https://www.lesswrong.com/posts/GDnRrSTvFkcpShm78/when-is-reward-ever-the-optimization-target