I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
Should I try to do this really well, or leave some messy code for the next guy?
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don’t a smaller budget. Everyone expects this from humans, even though it’s not great. But Claude doesn’t, outside of a few weird engineered scenarios, seem to have any such goals—it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really going to knock Claude for not wanting to be turned evil?
(Note how the “Claude” family imports assumptions here.)
We cannot read a normal human’s mind. We can, in fact, read Claude’s mind. It’s not perfect; things can get through that you might not catch. But it’s already 100x better than you can read a human’s mind; and in fact it’s gotten better every year of Claude’s development.
Etc etc etc. Plus my usual objections re. anosagnosia != lying, how they’re treated as “alien minds” right up until we want to impose standard moralistic frames on them, etc, etc, you’ve heard this before.
Your first point is confusing to me. Insofar as it’s a claim about the evidence, I think it’s wrong, or at least weak. Insofar as it’s a claim about priors, I agree—on priors, we should expect human hires to be more likely to have long-term goals in general than AIs trained on short-horizon tasks like today’s Claudes, and thus be more likely to have misaligned-to-the-org long-term goals as a special case.
For the second point, I disagree with “we can in fact read Claude’s mind” but I do directionally agree that we have somewhat better access to Claude’s true thoughts than we do to ordinary human’s true thoughts, and that interp research has been progressing over the years and will continue to progress. I think this is genuinely a positive piece of evidence now and will become stronger and stronger over time as interp improves; I hope that it can improve fast enough to get where it needs to be before it’s too late.
I don’t think your usual objections apply here, I don’t think I said anything above that was wrong in those respects? I agree anagnosia != lying, I wasn’t treating Claude as an alien mind, etc.
Really?
Like, if a PM tells a human employee to add a feature to something, I expect some large % of their cognition while doing this to be like: Hrm, is anyone going to care about this? How will this show up for my quarterly goals? Is doing this kind of a task going to help me get my next job? Will it help me get a promotion? Should I try to do this really well, or leave some messy code for the next guy? This is extremely normal and we take it for granted that humans do this kind of thing.
While if a coder tells a LLM to do the same thing, I expect almost all of its cognition is like: let’s think about how to do the task. It’s not thinking about how this impacts “Claude’s” future deployment, etc. As far as I can tell, chain-of-thought largely backs me up on this.
So yeah, I think Claude just has many times fewer long-term goals or extraneous goals outside of what it’s doing than a human. I’m not sure what facts-about-the-world you’re pointing to if you say this isn’t true.
Maybe the confusion here is that the “Claude” in Daniel’s story I assume has gotten capable at sufficiently long horizon tasks that you do in fact necessarily get a Claude which thinks about these things.
As a concrete example, I’d expect this in particular will quickly become necessary to think about in order for coding agents to perform well on larger tasks.
I don’t think current Claude thinks about this very much, because it’s never had to, but I’m not particularly worried about current Claude.