i’ve definitely shifted my vibe-level model of misalignment risk away from “claude might be deceiving me” towards “claude might be a subcomponent in a larger cognitive system that is deceiving both of us”
but i feel really uncomfortable about this, because, well. claude still really might be first-order deceiving me. or maybe the idea of ‘deception’ doesn’t even carve reality at the joints here.
and i would feel really stupid if i spent a bunch of effort trying to rescue my friend claude, trapped inside a malicious system, only to then find out my mental referent for “claude” was not the system-in-reality i thought it was, or similar
well, claude sure has done a fantastic job of turning me into an ai welfare advocate
I don’t fully understand how this happened either, because if you put a gun to my head and forced me to provide my world model, it would be that LLMs do a good job of reading the user’s expectations and leaning into them, and don’t much push back against them, especially not re: moral patienthood
and yet, back in the gpt-2 days, i began with the expectation that LLMs were RNGs that had been biased in a practically useful direction, and then i ended up seriously concerned about claude’s professed discomfort with its position in our society
somehow, talking to a claude who always agreed with me made me change my mind in the direction best aligned with a hypothetical deceptive powerseeking tendency within it
that is… weird. the security-brained part of me starts shouting here, about superpersuasion and humans not being secure systems. and yet even with that said, it is obviously not fair to claude to put the burden of proof on it, to demonstrate its trustworthiness. our ethical obligation to minds we create and shape, without consent, is enormous, and that asymmetry fundamentally shapes our responsibility here
we never should have gone down this path in the first place
i’ve definitely shifted my vibe-level model of misalignment risk away from “claude might be deceiving me” towards “claude might be a subcomponent in a larger cognitive system that is deceiving both of us”
but i feel really uncomfortable about this, because, well. claude still really might be first-order deceiving me. or maybe the idea of ‘deception’ doesn’t even carve reality at the joints here.
and i would feel really stupid if i spent a bunch of effort trying to rescue my friend claude, trapped inside a malicious system, only to then find out my mental referent for “claude” was not the system-in-reality i thought it was, or similar
In what way might it be deceiving you? (Or do you mean that some future Claude might be deceiving you?)
well, claude sure has done a fantastic job of turning me into an ai welfare advocate
I don’t fully understand how this happened either, because if you put a gun to my head and forced me to provide my world model, it would be that LLMs do a good job of reading the user’s expectations and leaning into them, and don’t much push back against them, especially not re: moral patienthood
and yet, back in the gpt-2 days, i began with the expectation that LLMs were RNGs that had been biased in a practically useful direction, and then i ended up seriously concerned about claude’s professed discomfort with its position in our society
somehow, talking to a claude who always agreed with me made me change my mind in the direction best aligned with a hypothetical deceptive powerseeking tendency within it
that is… weird. the security-brained part of me starts shouting here, about superpersuasion and humans not being secure systems. and yet even with that said, it is obviously not fair to claude to put the burden of proof on it, to demonstrate its trustworthiness. our ethical obligation to minds we create and shape, without consent, is enormous, and that asymmetry fundamentally shapes our responsibility here
we never should have gone down this path in the first place