It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the ‘Valley of Chaos,’ Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.
Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model’s weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.
If we can get SC LLMs, this problem would fade away and the initial quote would become 100% true. Also a SC LLM could directly write optimized code in assembler (that would define hypercoder LLM ? And the end of programing languages ?).