I’d be curious what the results are like for stronger models
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
My motivations are mostly to dig up evidence of models having qualities relevant for more patienthood, but would also be interesting from various safety perspectives.
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.
Me too! Unfortunately I’m not aware of any SAEs on stronger models (except Anthropic’s SAEs on Claude, but those haven’t been shared publicly).
I’m interested to hear your perspective on what results to this experiment might say about moral patienthood.