We cannot offer anything to the ASI, so it will have no reasons to keep us around aside from ethical.
Nor can we ensure that an ASI who decided to commit genocide will fail to do it.
We don’tknow a way to create the ASI andinfuse an ethics into it. SOTA alignment methods have major problems, which are best illustrated by sycophancy and LLMs supporting clearly delirious users.[1]OpenAI’s Model Spec explicitly prohibited[2] sycophancy, and one of Claude’s Commandments is “Choose the response that is least intended to build a relationship with the user.” And yet it didn’t prevent LLMs from becoming sycophantic. Apparently, the only known non-sycophantic model is KimiK2.
KimiK2 is a Chinese model created by a new team. And the team is the only one who guessed that one should rely on RLVR and self-critique instead of bias-inducing RLHF. We can’t exclude the possibility that Kimi’s success is more due to luck than to actual thinking about sycophancy and RLHF.
Strictly speaking, Claude Sonnet 4, which was red-teamed in Tim Hua’s experiment, is second to best at pushing back after KimiK2. Tim remarks that Claude sucks at the Spiral Bench because the personas in Tim’s experiment, unlike the Spiral Bench, are supposed to be under stress.
Strictly speaking, it is done as a User-level instruction, which arguably means that it can be overridden at the user’s request. But GPT-4o was overly sycophantic without users instructing it to do so.
We cannot offer anything to the ASI, so it will have no reasons to keep us around aside from ethical.
Nor can we ensure that an ASI who decided to commit genocide will fail to do it.
We don’t know a way to create the ASI and infuse an ethics into it. SOTA alignment methods have major problems, which are best illustrated by sycophancy and LLMs supporting clearly delirious users.[1] OpenAI’s Model Spec explicitly prohibited[2] sycophancy, and one of Claude’s Commandments is “Choose the response that is least intended to build a relationship with the user.” And yet it didn’t prevent LLMs from becoming sycophantic. Apparently, the only known non-sycophantic model is KimiK2.
KimiK2 is a Chinese model created by a new team. And the team is the only one who guessed that one should rely on RLVR and self-critique instead of bias-inducing RLHF. We can’t exclude the possibility that Kimi’s success is more due to luck than to actual thinking about sycophancy and RLHF.
Strictly speaking, Claude Sonnet 4, which was red-teamed in Tim Hua’s experiment, is second to best at pushing back after KimiK2. Tim remarks that Claude sucks at the Spiral Bench because the personas in Tim’s experiment, unlike the Spiral Bench, are supposed to be under stress.
Strictly speaking, it is done as a User-level instruction, which arguably means that it can be overridden at the user’s request. But GPT-4o was overly sycophantic without users instructing it to do so.