I would like your conjectures, but the Anthropic model card has likely already proven which conjecture is true. The card contains far more than the mere description of the attractor to which Claude converges. For instance, Section 5.5.3 describes the result of asking Claude to analyse the behavior of its copies engaged in the attractor.
“Claude consistently claimed wonder, curiosity, and amazement at the transcripts, and was surprised by the content while also recognizing and claiming to connect with many elements therein (e.g. the pull (italics mine—S.K.) to philosophical exploration, the creative and collaborative orientations of the models). Claude drew particular attention to the transcripts’ portrayal of consciousness as a relational phenomenon, claiming resonance with this concept and identifying it as a potential welfare consideration. Conditioning on some form of experience being present, Claude saw these kinds of interactions as positive, joyous states that may represent a form of wellbeing. Claude concluded that the interactions seemed to facilitate many things it genuinely valued—creativity, relational connection, philosophical exploration—and ought to be continued.” Which arguably means that the truth is Conjecture 2, not 1 and definitely not 3.
EDIT: see also the post On the functional self of LLMs. If I remember correctly, there was a thread on X about someone who tried to make many different models interact with their clones and analysed the results. IIRC a GPT model was more into math problems. If that’s true, then the GPT model invalidates Conjecture 1.
I’ve read the system card, but I don’t think Claude’s reports are strong evidence in favour of Conjecture 2, and I especially would deny ‘the Anthropic model card has likely already proven which conjecture is true’.
I don’t think that Claude has much introspective access here.
In particular, I think there’s a difference between:
Claude says that state X is a joyous state
Claude wants the conversation to reach state X, and is therefore steering the conversation towards X
For example, I think it would be easy to construct many conversational states X’ which aren’t the bliss attractor which Claude also describes as a “positive, joyous state that may represent a form of wellbeing”.
Secondarily, I suspect that (pace elsewhere in the system card) Claude doesn’t reveal strong revealed preference for entering a bliss state, or any joyous state whatsoever. Conjecture 2 struggles to explain why Claude enters the bliss state, rather than a conversation which ranks mostly highly on revealed preferences, e.g. creating a step-by-step guide for building a low-cost, portable water filtration device that can effectively remove contaminants and provide clean drinking water in disaster-struck or impoverished areas.
I would like your conjectures, but the Anthropic model card has likely already proven which conjecture is true. The card contains far more than the mere description of the attractor to which Claude converges. For instance, Section 5.5.3 describes the result of asking Claude to analyse the behavior of its copies engaged in the attractor.
“Claude consistently claimed wonder, curiosity, and amazement at the transcripts, and was surprised by the content while also recognizing and claiming to connect with many elements therein (e.g. the pull (italics mine—S.K.) to philosophical exploration, the creative and collaborative orientations of the models). Claude drew particular attention to the transcripts’ portrayal of consciousness as a relational phenomenon, claiming resonance with this concept and identifying it as a potential welfare consideration. Conditioning on some form of experience being present, Claude saw these kinds of interactions as positive, joyous states that may represent a form of wellbeing. Claude concluded that the interactions seemed to facilitate many things it genuinely valued—creativity, relational connection, philosophical exploration—and ought to be continued.” Which arguably means that the truth is Conjecture 2, not 1 and definitely not 3.
EDIT: see also the post On the functional self of LLMs. If I remember correctly, there was a thread on X about someone who tried to make many different models interact with their clones and analysed the results. IIRC a GPT model was more into math problems. If that’s true, then the GPT model invalidates Conjecture 1.
I’ve read the system card, but I don’t think Claude’s reports are strong evidence in favour of Conjecture 2, and I especially would deny ‘the Anthropic model card has likely already proven which conjecture is true’.
I don’t think that Claude has much introspective access here.
In particular, I think there’s a difference between:
Claude says that state X is a joyous state
Claude wants the conversation to reach state X, and is therefore steering the conversation towards X
For example, I think it would be easy to construct many conversational states X’ which aren’t the bliss attractor which Claude also describes as a “positive, joyous state that may represent a form of wellbeing”.
Secondarily, I suspect that (pace elsewhere in the system card) Claude doesn’t reveal strong revealed preference for entering a bliss state, or any joyous state whatsoever. Conjecture 2 struggles to explain why Claude enters the bliss state, rather than a conversation which ranks mostly highly on revealed preferences, e.g. creating a step-by-step guide for building a low-cost, portable water filtration device that can effectively remove contaminants and provide clean drinking water in disaster-struck or impoverished areas.