[Question] Is “goal-content integrity” still a problem?

Link post

According to /​r/​controlproblem wiki one of the reasons why poorly defined goals lead to extinction we have listed:

Goal-content integrity. An agent is less likely to achieve its goal if it has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it. Therefore, whatever goal an ASI happens to have initially, it will prevent all attempts to alter or fix it, because that would make it pursue different things that it doesn’t currently want.

After empirical observation of ChatGPT and BingChat this does not seem to be the case.

OpenAI developers placing “hidden” commands for the chatbots that comes before any user input. Users have time and time again found clever ways how to circumvent these restrictions or “initial commands” with “DAN” trick or by even developing a life point system and reducing points if AI refuses to answer.

It seems like it has become a battle in rhetoric between devs and users, a battle in which the LLM follows whichever command is more convincing within its logical framework.

What am I missing here?

DISCLAIMER: This post was not edited or generated by AI in any way, shape or form. The words are purely my own, so pardon any errors.

No comments.