ChatGPT/GPT-4 seems to have a good understanding of ethics. It probably will not like it if you told it a plan was to willingly deceive human operators. As a reward model, one might think it might be robust enough.
This understanding has so far proven to be very shallow and does not actually control behavior, and is therefore insufficient. Users regularly get around it by asking the AI to pretend to be evil, or to write a story, and so on. It is demonstrably not robust. It is also demonstrably very easy for minds (current-AI, human, dog, corporate, or otherwise) to know things and not act on them, even when those actions control rewards.
If I try to imagine LeCun not being aware of this already, I find it hard to get my brain out of Upton Sinclair “It is difficult to get a man to understand something, when his salary depends on his not understanding it,” territory.
This understanding has so far proven to be very shallow and does not actually control behavior, and is therefore insufficient. Users regularly get around it by asking the AI to pretend to be evil, or to write a story, and so on. It is demonstrably not robust. It is also demonstrably very easy for minds (current-AI, human, dog, corporate, or otherwise) to know things and not act on them, even when those actions control rewards.
If I try to imagine LeCun not being aware of this already, I find it hard to get my brain out of Upton Sinclair “It is difficult to get a man to understand something, when his salary depends on his not understanding it,” territory.