One curated post describes Tim Hua’s red-teaming experiment where KimiK2 was found NOT to induce psychosis. Another high-level post describes how emergent misalignment can be caused by unpopular preferences.
As for top posts being skewed, it might be caused by LWers not paying attention to news about incremental progress. Or by alignment being actually hard. Especially if progress is made inside labs which, unlike the KimiK2 team, are less willing to try radically new methods that could actually solve problems.
On the other hand, Claude Sonnet 4.5 was tested and found to be radically LESS sycophantic than Claude Sonnet 4. Alas, C. Sonnet 4.5 is more situationally aware...
P.S. Alignment techniques are likely far harder to test than capabilities techniques. It could be possible to test architectures like neuralese on tiny networks dedicated to solving narrow tasks, but IMpossible to test alignment on such networks, since they don’t have ways to, say, hack the reward. But alignment techniques that ordinary people CAN test are only finetuning and red teaming.
One curated post describes Tim Hua’s red-teaming experiment where KimiK2 was found NOT to induce psychosis. Another high-level post describes how emergent misalignment can be caused by unpopular preferences.
As for top posts being skewed, it might be caused by LWers not paying attention to news about incremental progress. Or by alignment being actually hard. Especially if progress is made inside labs which, unlike the KimiK2 team, are less willing to try radically new methods that could actually solve problems.
On the other hand, Claude Sonnet 4.5 was tested and found to be radically LESS sycophantic than Claude Sonnet 4. Alas, C. Sonnet 4.5 is more situationally aware...
P.S. Alignment techniques are likely far harder to test than capabilities techniques. It could be possible to test architectures like neuralese on tiny networks dedicated to solving narrow tasks, but IMpossible to test alignment on such networks, since they don’t have ways to, say, hack the reward. But alignment techniques that ordinary people CAN test are only finetuning and red teaming.