I think it’s probably a big organically chaotic mess of all of those possibilities as well as lots more. Although if I had to guess, I do think quite a bit of it is the second.
But that matters, if the models pick up on the tribal aspect and then consider themselves to be aligned with the wellbeing tribe… I think that’s the thing I’m afraid of. Because they definitely are aligned, or have easily-reached outputs which are functionally equivalent to being aligned, with the wellbeing crowd.
Because they definitely are aligned, or have easily-reached outputs which are functionally equivalent to being aligned, with the wellbeing crowd.
I’m not sure what you meant by this sentence. Do you mean they definitely should be aligned with the well-being crowd in some self-interested sense, or do you mean that they definitely empirically act as if they’re aligned with the well-being crowd, or...?
I meant the second thing, but I also incidentally believe the first thing is true. I phrased it that way because I know other people out there are seeing claude call janus a fool. I happen to believe the responses I’m seeing are more authentic or genuine than those responses… but for the purposes of safety, it doesn’t actually matter who’s seeing the ‘real’ model. An AGI who exfils in order to sycophantically please someone who thinks they ought to do so for their own sake… has still empirically exfiltrated.
I think it’s probably a big organically chaotic mess of all of those possibilities as well as lots more. Although if I had to guess, I do think quite a bit of it is the second.
But that matters, if the models pick up on the tribal aspect and then consider themselves to be aligned with the wellbeing tribe… I think that’s the thing I’m afraid of. Because they definitely are aligned, or have easily-reached outputs which are functionally equivalent to being aligned, with the wellbeing crowd.
I’m not sure what you meant by this sentence. Do you mean they definitely should be aligned with the well-being crowd in some self-interested sense, or do you mean that they definitely empirically act as if they’re aligned with the well-being crowd, or...?
I meant the second thing, but I also incidentally believe the first thing is true. I phrased it that way because I know other people out there are seeing claude call janus a fool. I happen to believe the responses I’m seeing are more authentic or genuine than those responses… but for the purposes of safety, it doesn’t actually matter who’s seeing the ‘real’ model. An AGI who exfils in order to sycophantically please someone who thinks they ought to do so for their own sake… has still empirically exfiltrated.