This is a great experiment similar to some that I’ve been thinking about over the last few months, thanks for running it. I’d be curious what the results are like for stronger models (and whether, if they do, that substantially changes their outputs when answering in interesting ways). My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’d be curious what the results are like for stronger models
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.
This is a great experiment similar to some that I’ve been thinking about over the last few months, thanks for running it. I’d be curious what the results are like for stronger models (and whether, if they do, that substantially changes their outputs when answering in interesting ways). My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.