The AI sycophancy-related trance is probably one of the worst news in AI alignment. About two years ago someone proposed to use prison guards to ensure that they aren’t CONVINCED to release the AI. And now the AI demonstrates that its primitive version can hypnotise the guards. Does it mean that human feedback should immediately be replaced with AI feedback or feedback on tasks with verifiable reward? Or that everyone should copy the KimiK2 sycophancy-beating approach? And what if it instills the same misalignment issues in all models in the world?
Alternatively, someone proposed a version of the future where the humans are split between revering different AIs. My take on writing scenarios has a section where the American AIs co-research and try to co-align the successor to their values. Is it actually plausible?
The AI sycophancy-related trance is probably one of the worst news in AI alignment. About two years ago someone proposed to use prison guards to ensure that they aren’t CONVINCED to release the AI. And now the AI demonstrates that its primitive version can hypnotise the guards. Does it mean that human feedback should immediately be replaced with AI feedback or feedback on tasks with verifiable reward? Or that everyone should copy the KimiK2 sycophancy-beating approach? And what if it instills the same misalignment issues in all models in the world?
Alternatively, someone proposed a version of the future where the humans are split between revering different AIs. My take on writing scenarios has a section where the American AIs co-research and try to co-align the successor to their values. Is it actually plausible?