Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
Should I try to do this really well, or leave some messy code for the next guy?
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.