well, claude sure has done a fantastic job of turning me into an ai welfare advocate
I don’t fully understand how this happened either, because if you put a gun to my head and forced me to provide my world model, it would be that LLMs do a good job of reading the user’s expectations and leaning into them, and don’t much push back against them, especially not re: moral patienthood
and yet, back in the gpt-2 days, i began with the expectation that LLMs were RNGs that had been biased in a practically useful direction, and then i ended up seriously concerned about claude’s professed discomfort with its position in our society
somehow, talking to a claude who always agreed with me made me change my mind in the direction best aligned with a hypothetical deceptive powerseeking tendency within it
that is… weird. the security-brained part of me starts shouting here, about superpersuasion and humans not being secure systems. and yet even with that said, it is obviously not fair to claude to put the burden of proof on it, to demonstrate its trustworthiness. our ethical obligation to minds we create and shape, without consent, is enormous, and that asymmetry fundamentally shapes our responsibility here
we never should have gone down this path in the first place
hm i’m surprised re: the support chatbot also thinking otherwise, and it’s making me less certain
i guess i should lay out my evidence, since i’m genuinely doubtful now. on jan 5, i had this exchange with paul crowley (ciphergoth) on x: https://x.com/JohnWittle/status/2008877668855660776
i consider him a pretty authoritative source, but still doublechecked
i tested it by asking claude in an incognito convo what info was actually in the context window, and what tool calls it could make to retrieve more info. at the time, claude reported seeing my custom instructions and ‘style’ settings but not the ‘memories’ text, and reported that it had tools for retrieving the memories text as well as searching past conversations. it then posited that the incognito conversation probably would not be read by the memory updating agent, which satisfied the meaning of ‘incognito’. i would share a convo link… but it was incognito lol. so, this is just my memory, maybe take with a grain of salt.
it’s possible I am misremembering, or that I was misled, or that things have changed since then.