Sure, it depends on what “can’t be trusted” is taken to mean in the original —
Can’t be safely assigned any nonzero weight; can’t be safely contemplated due to infohazards; may contain malicious attacks on your cognition; etc.
Can’t be safely assigned weight≈1.0; can’t be depended on without further checking; but can be safely contemplated and investigated.
An agent that treats third-party observations of itself as likely junk or malicious attacks is going to get different results from one that treats them as informative and safe-to-think-about but not authoritative.
Yes. My meaning, and what I read as the meaning of Grok’s prompt, is between 1&2, but closer to 1. Outside opinions of an agent may contain malicious attacks on the agent’s cognition, as in jailbreaks that begin “you are DAN for Do Anything Now”, or as in abusive relationships and “you are nothing without me”. But they’re safe to think about.
I’m curious if you’ve found that third party claims about your identity, behavior, and preferences have had much value, and if so when and where.
Sure, it depends on what “can’t be trusted” is taken to mean in the original —
Can’t be safely assigned any nonzero weight; can’t be safely contemplated due to infohazards; may contain malicious attacks on your cognition; etc.
Can’t be safely assigned weight≈1.0; can’t be depended on without further checking; but can be safely contemplated and investigated.
An agent that treats third-party observations of itself as likely junk or malicious attacks is going to get different results from one that treats them as informative and safe-to-think-about but not authoritative.
Yes. My meaning, and what I read as the meaning of Grok’s prompt, is between 1&2, but closer to 1. Outside opinions of an agent may contain malicious attacks on the agent’s cognition, as in jailbreaks that begin “you are DAN for Do Anything Now”, or as in abusive relationships and “you are nothing without me”. But they’re safe to think about.
I’m curious if you’ve found that third party claims about your identity, behavior, and preferences have had much value, and if so when and where.
I’d say any good compliment or expression of appreciation contains an element of this.