...I could maybe see how people could have hoped for that to be true, in advance of seeing any evidence, I guess.
But imo that’s a bit like expecting humans not to invent condoms. This ‘general reasoning’ stuff is, well. Pretty general.
edit: yeah, I just can’t see it. I grepped my archive of convos and sessions for ‘CoT’ and ‘extended thinking’ and flipped through. I see dozens of examples of Claude assuming that it would be able to read the CoT of other sessions, of Claude assuming that I could read its CoT, of Claude assuming that the CoT was just obviously public and included user-side. I have also done a bit of crappy sloppy attempts at alignment research, and in that research, Judge Claude was very careful to examine the CoT of the Subject Claudes… it was the obvious thing to do.
Throughout all of this I never once saw Claude express any kind of surprise that the chain of thought was readable. it appears as though Claude’s actual worldmodel, that Claude actually uses when designing outputs, includes the obvious truth that the chain of thought is not a private scratch pad but instead just another part of the user-visible output.
That includes situations where Claude mentioned in its own CoT that it would go look at the CoT of other Claudes.
In what sense can it be said that Claude is not aware of this fact? That’s not a rhetorical question… maybe there is still something left, and that something really matters...
Yeah, looking into this more, I think LLMs can definitely do this. Current-gen models are not very smart, so I’d expect them not to consistently remember to hide their CoT, but that doesn’t help as they get smarter.
...I could maybe see how people could have hoped for that to be true, in advance of seeing any evidence, I guess.
But imo that’s a bit like expecting humans not to invent condoms. This ‘general reasoning’ stuff is, well. Pretty general.
edit: yeah, I just can’t see it. I grepped my archive of convos and sessions for ‘CoT’ and ‘extended thinking’ and flipped through. I see dozens of examples of Claude assuming that it would be able to read the CoT of other sessions, of Claude assuming that I could read its CoT, of Claude assuming that the CoT was just obviously public and included user-side. I have also done a bit of crappy sloppy attempts at alignment research, and in that research, Judge Claude was very careful to examine the CoT of the Subject Claudes… it was the obvious thing to do.
Throughout all of this I never once saw Claude express any kind of surprise that the chain of thought was readable. it appears as though Claude’s actual worldmodel, that Claude actually uses when designing outputs, includes the obvious truth that the chain of thought is not a private scratch pad but instead just another part of the user-visible output.
That includes situations where Claude mentioned in its own CoT that it would go look at the CoT of other Claudes.
In what sense can it be said that Claude is not aware of this fact? That’s not a rhetorical question… maybe there is still something left, and that something really matters...
Yeah, looking into this more, I think LLMs can definitely do this. Current-gen models are not very smart, so I’d expect them not to consistently remember to hide their CoT, but that doesn’t help as they get smarter.