I think a reasonable version of this (done on e.g. Claude 4.5 Sonnet) would be pretty likely to result in preferences that care a decent amount about keeping humans alive with their preferences satisfied
I know this is speculative, but is your intuition that this is also true for OpenAI models? (ex: GPT-5, o3)?
I know this is speculative, but is your intuition that this is also true for OpenAI models? (ex: GPT-5, o3)?
Probably? But less likely. Shrug.