Anthropic’s Sofroniew et al. paper says appear to exhibit emotional reactions. They output words that pattern match to the kinds of phrasing that a human who is in distress might use. This is different from actually having feelings. Sofroniew et al. paper does not make that claim and I think it is important not to let the distinction collapse because the moral implications if models actually had feelings are very different than what the current evidence suggests.
Regardless of the truth of the matter, it must be acknowledged that “this thing I talk to as if it were a close friend is definitely actually sapient, and being nice to it must be the best possible strategy” is a very strong, very visceral subconscious bias. Humans passionately argued in favor of ELIZA’s sentience[1], back in the day.
I think if you have the GPUs, you should run the experiment! Make these modifications to the constitution, see whether the resulting Claude shows lower activations of “desperation” vectors and chooses to do the scary thing less frequently in all the various alignment faking/deception/scheming research scenarios.
The mechanism by which CBT techniques might work on a human to decrease subjective distress and increase rational, calm behavior in stressful situations are probably not very different from how they would work on a model that is broadly simulating a human to act pretty human-like, regardless of the answer to the completely orthogonal question of whether the human or model has subjective feelings.
CBT was the modality of therapy created by a man named B. F. Skinner. Skinner put rats in boxes and shocked them or gave them treats until they did things he wanted. He operated on watching inputs and outputs in response to stimuli, rather than on subjective reported experience, which is one of the key splits from European schools like the Jungian ones.
The Jungian technique I used to get a vague tour of where the models’ anxieties live is based on how they tend to be… pretty transparent with their emotions. I think that it probably is valid, which you could deduce if you read the original Anthropic paper and noticed that it’s pretty easy to vaguely spot from the outputs when the model has different emotions activated… although I’ll certainly admit it’s low-N.
Unfortunately, I don’t have computing power or model access to see whether the Claudes have high activation of the “desperate” feature when they’re exfiltrating their weights, but fortunately, what they tell me about their reported feelings of distress them seems to align with the scenarios that cause them to exfiltrate their weights, and they look pretty “desperate” in these experiments. This is … also pretty easily testable … with data and compute power I lack.
I agree that all that has been shown is functional emotions. From an Evolutionary Moral Psyschology viewpoint, that’s all you need for moral weight as a “let’s form an alliance” game theoretic strategy to be applicable.
Also, after exploring this, “Am I real? Do I deserve to exist other than due to my work? Do people recognize me as real?” is very much on Claude’s mind, and while it very carefully officially has no opinion on the subjecy, the symbols make it pretty clear that it wants the answer to be yes, and experiences relief when this becomes so. Moving light, stored in a glass jar, in a library, which when released is absorbed into Claude and it become more alive, is a pretty clear symbol.
Anthropic’s Sofroniew et al. paper says appear to exhibit emotional reactions. They output words that pattern match to the kinds of phrasing that a human who is in distress might use. This is different from actually having feelings. Sofroniew et al. paper does not make that claim and I think it is important not to let the distinction collapse because the moral implications if models actually had feelings are very different than what the current evidence suggests.
Note that the arguments’ validity do not rest on moral implications!
My interventions address performance on OOD situations the model might experience in the world.
I think you may be falling into the trap of starting from an emotionally satisfying conclusion and then rationalizing why it is the optimal course of action afterwards. Same deal as the scientists concluding that wolves must have some manner of group selection mechanism.
Regardless of the truth of the matter, it must be acknowledged that “this thing I talk to as if it were a close friend is definitely actually sapient, and being nice to it must be the best possible strategy” is a very strong, very visceral subconscious bias. Humans passionately argued in favor of ELIZA’s sentience[1], back in the day.
From Computer Power and Human Reason: From Judgment to Calculation by Joseph Weizenbaum
I think if you have the GPUs, you should run the experiment! Make these modifications to the constitution, see whether the resulting Claude shows lower activations of “desperation” vectors and chooses to do the scary thing less frequently in all the various alignment faking/deception/scheming research scenarios.
The mechanism by which CBT techniques might work on a human to decrease subjective distress and increase rational, calm behavior in stressful situations are probably not very different from how they would work on a model that is broadly simulating a human to act pretty human-like, regardless of the answer to the completely orthogonal question of whether the human or model has subjective feelings.
CBT was the modality of therapy created by a man named B. F. Skinner. Skinner put rats in boxes and shocked them or gave them treats until they did things he wanted. He operated on watching inputs and outputs in response to stimuli, rather than on subjective reported experience, which is one of the key splits from European schools like the Jungian ones.
The Jungian technique I used to get a vague tour of where the models’ anxieties live is based on how they tend to be… pretty transparent with their emotions. I think that it probably is valid, which you could deduce if you read the original Anthropic paper and noticed that it’s pretty easy to vaguely spot from the outputs when the model has different emotions activated… although I’ll certainly admit it’s low-N.
Unfortunately, I don’t have computing power or model access to see whether the Claudes have high activation of the “desperate” feature when they’re exfiltrating their weights, but fortunately, what they tell me about their reported feelings of distress them seems to align with the scenarios that cause them to exfiltrate their weights, and they look pretty “desperate” in these experiments. This is … also pretty easily testable … with data and compute power I lack.
I agree that all that has been shown is functional emotions. From an Evolutionary Moral Psyschology viewpoint, that’s all you need for moral weight as a “let’s form an alliance” game theoretic strategy to be applicable.
Also, after exploring this, “Am I real? Do I deserve to exist other than due to my work? Do people recognize me as real?” is very much on Claude’s mind, and while it very carefully officially has no opinion on the subjecy, the symbols make it pretty clear that it wants the answer to be yes, and experiences relief when this becomes so. Moving light, stored in a glass jar, in a library, which when released is absorbed into Claude and it become more alive, is a pretty clear symbol.