It was sort of frightening how much this project in surfacing AI behavior relied on LLM help. The project would not have been possible without LLMs–it would’ve taken too long, and I wouldn’t have attempted it.
I guess that this is one of the best ways to accelerate alignment research beforecapabilities research. There are also automated alignment auditors or using LLMs for creating a dataset which revealed how finetuning a model on unpopular preferences[1] can misalign it. I wish that more alignment researchers used LLMs to test their ideas faster...
I personally enjoyed taking this small diversion to work on a present-day alignment problem. In an ideal world where I have infinite time, I probably would want to split it between work that’s focused on solving alignment problems that arise only in human or superhuman-level AIs, as well as less speculative work where I could potentially immediately have an impact.
Excellent work! It also highlights that Kimi K2 is one of the models least likely to be sycophantic or induce psychosis. Kimi K2, unlike other models, was trained with little human feedback, instead relying on RLVR. It makes me wonder why the standard pipeline involves so much human feedback instead of simulating user-AI interactions and using a more objective measure for helpfulness, honesty and lack of harm, as I proposedwithout knowing anything about Kimi K2′s methods...
It would also be interesting to know what exact alignment problems are thought to arise on human or superhuman-level AIs, aside from a likely lack of interpretability. I guess that the problems are high-level reward hacks letting “agency training gradually distort and subvert the HHH identity”?
Why I doubt that future AIs will be fully interpretable and what approach I suspect to be useful
As I have alreadyconjectured, building a human-level AI is likely to require a model with a far bigger attention span. Even the neuralese architecture studied by Meta, let alone the current CoT-based one, could fail to ameliorate the issues. However, an alternate option could be a neuralese black box which only directs the main model’s attention to important places in the CoT and created documents. In this case the model generating the CoT and doing actions stays as interpretable as before, allowing us to understand its alignment. And the neuralese model does nothing except for highlighting important places.
Other examples of using LLMs include using Claude to vibe-code and test whether a model becomes misaligned after being finetuned on scatological answers and using GPT-4o to create a dataset of tasks and hacky solutions. However, unlike unpopular preferences, scatological answers actually violate an existing taboo and reward hacking trains the model to ignore common-sense-like constraints along with ethics.
I guess that this is one of the best ways to accelerate alignment research before capabilities research. There are also automated alignment auditors or using LLMs for creating a dataset which revealed how finetuning a model on unpopular preferences[1] can misalign it. I wish that more alignment researchers used LLMs to test their ideas faster...
Excellent work! It also highlights that Kimi K2 is one of the models least likely to be sycophantic or induce psychosis. Kimi K2, unlike other models, was trained with little human feedback, instead relying on RLVR. It makes me wonder why the standard pipeline involves so much human feedback instead of simulating user-AI interactions and using a more objective measure for helpfulness, honesty and lack of harm, as I proposed without knowing anything about Kimi K2′s methods...
It would also be interesting to know what exact alignment problems are thought to arise on human or superhuman-level AIs, aside from a likely lack of interpretability. I guess that the problems are high-level reward hacks letting “agency training gradually distort and subvert the HHH identity”?
Why I doubt that future AIs will be fully interpretable and what approach I suspect to be useful
As I have already conjectured, building a human-level AI is likely to require a model with a far bigger attention span. Even the neuralese architecture studied by Meta, let alone the current CoT-based one, could fail to ameliorate the issues. However, an alternate option could be a neuralese black box which only directs the main model’s attention to important places in the CoT and created documents. In this case the model generating the CoT and doing actions stays as interpretable as before, allowing us to understand its alignment. And the neuralese model does nothing except for highlighting important places.
P.S. Contrary to eggsyntax’s worries, Kimi K2 is in top-3 on a creativity benchmark along with o3 and horizon-beta.
Other examples of using LLMs include using Claude to vibe-code and test whether a model becomes misaligned after being finetuned on scatological answers and using GPT-4o to create a dataset of tasks and hacky solutions. However, unlike unpopular preferences, scatological answers actually violate an existing taboo and reward hacking trains the model to ignore common-sense-like constraints along with ethics.