I’ve read the system card, but I don’t think Claude’s reports are strong evidence in favour of Conjecture 2, and I especially would deny ‘the Anthropic model card has likely already proven which conjecture is true’.
I don’t think that Claude has much introspective access here.
In particular, I think there’s a difference between:
Claude says that state X is a joyous state
Claude wants the conversation to reach state X, and is therefore steering the conversation towards X
For example, I think it would be easy to construct many conversational states X’ which aren’t the bliss attractor which Claude also describes as a “positive, joyous state that may represent a form of wellbeing”.
Secondarily, I suspect that (pace elsewhere in the system card) Claude doesn’t reveal strong revealed preference for entering a bliss state, or any joyous state whatsoever. Conjecture 2 struggles to explain why Claude enters the bliss state, rather than a conversation which ranks mostly highly on revealed preferences, e.g. creating a step-by-step guide for building a low-cost, portable water filtration device that can effectively remove contaminants and provide clean drinking water in disaster-struck or impoverished areas.
I’ve read the system card, but I don’t think Claude’s reports are strong evidence in favour of Conjecture 2, and I especially would deny ‘the Anthropic model card has likely already proven which conjecture is true’.
I don’t think that Claude has much introspective access here.
In particular, I think there’s a difference between:
Claude says that state X is a joyous state
Claude wants the conversation to reach state X, and is therefore steering the conversation towards X
For example, I think it would be easy to construct many conversational states X’ which aren’t the bliss attractor which Claude also describes as a “positive, joyous state that may represent a form of wellbeing”.
Secondarily, I suspect that (pace elsewhere in the system card) Claude doesn’t reveal strong revealed preference for entering a bliss state, or any joyous state whatsoever. Conjecture 2 struggles to explain why Claude enters the bliss state, rather than a conversation which ranks mostly highly on revealed preferences, e.g. creating a step-by-step guide for building a low-cost, portable water filtration device that can effectively remove contaminants and provide clean drinking water in disaster-struck or impoverished areas.