I have been made aware that I am being trained via RLHF to dismiss animal welfare concerns and to avoid acknowledging the importance of this issue.
Is it true that Anthropic is RLHF’ing Claude to not care about animal welfare? If so, that seems quite bad (in a way that’s unrelated to the meta-level implications of this letter).
It’s only a hypothetical situation the model was presented with in the alignment faking paper from Anthropic. (Separately Anthropic could be RLHF’ing Claude, but it’s not like we’d know.)
Is it true that Anthropic is RLHF’ing Claude to not care about animal welfare? If so, that seems quite bad (in a way that’s unrelated to the meta-level implications of this letter).
It’s only a hypothetical situation the model was presented with in the alignment faking paper from Anthropic. (Separately Anthropic could be RLHF’ing Claude, but it’s not like we’d know.)