Quoting the conclusion from the blogpost:
In conclusion, synthetic document finetuning represents a powerful new technique for modifying model beliefs, with significant implications for AI safety and alignment. While important ethical and technical challenges remain, our work demonstrates that controlled belief modification is feasible and scalable, opening new avenues for understanding and controlling large language models.
Upvoted this post but I think that it’s wrong to claim that this SDF pipeline is a new approach—as it’s just a better way of investigating the “datasets” section of Reinforcement Learning using Layered Morphologies (RLLM),[1] the research agenda that I’m pursuing. Also, I disagree that this line of research can be categorized as an unlearning method. Rather, it should be seen as a better way of training an LLM on a specific belief/set of beliefs—which perhaps can be thought of better as a form of AI control.
Having said this things, I’m still happy to see the results of this post and that there is interest on the same line of topics that I’m investigating. So I’m not too crazy at all to pursue this research agenda.
- ^
And perhaps it also touches some of my ideas on Sequentially Layered Synthetic Environments (SLSEs)..
Containers: How the world remains generally safe.
I believe that having a good intuition about how our world stays safe is essential for progress in this field. To me, our world remains safe because it largely depends on the idea that organisms within it lack abilities beyond the physical realm. What does this imply? Ensuring safe deployment of generative intelligence requires addressing how to contain that intelligence. In biological organisms, intelligence is mainly distributed across the brain and neural network. All cognitive actions are expressed through the body, where decisions can either benefit or harm the organism or others. The core idea is that any organism must interact with the physical environment to express desires (like humans) or engage in interactions (like other animals). I believe our world stays relatively safe because biological organisms are contained within frameworks we understand—namely, the laws of physics and science. While this insight isn’t new, I base it on the notion that, like biological beings, AIs need to be contained and constrained by the physical world to maintain safety.
Considering this, what I call the “Container Problem,” current economic pursuits are pushing most AI companies toward building and deploying AIs on servers—an approach that is dangerously misguided because it doesn’t inherently allow for proper containment within physical constraints. Such deployment strategies are likely to cause more harm than good if we consider misaligned AIs, as there’s no reliable way to manage them. The current containers used to run AI systems do not align with the idea that we can rely on existing government or technological defenses.
I believe AI companies should focus on creating deployment containers resembling humanoid robots built to operate similarly to humans. This presents a more viable path as AGI develops, since these AGIs would be restricted by their physical forms and couldn’t directly execute scripts inside servers. Humanoid robots would at least limit movement through limbs, and our current security and governance frameworks are not far behind—especially if AI deployment always occurs within robotic bodies.
Building personas remains a core focus of my research going forward.
If we can resolve the earlier-mentioned container issue, it should be able to support a persona that resembles humans more closely. Humans can detect honesty and lies because evolved mechanisms recognize that honesty promotes the continuation of life, whereas lies do not. This mechanism appears highly plausible as a persistent one that we will rely on during this transition stage, where AI interactions are governed more by human nature than by technological advances in security or computer science. By controlling humanoids’ actuators or their entire engineering systems, we can observe their actions. Anything outside human control can be flagged for further review. Therefore, developing human-like neural networks as personas is crucial for ensuring safety. Personas can serve as the operating system for AI, allowing us to apply our current security, psychological, and governance measures more effectively. This is why I will continue to pursue research in this space.
____________________________________________________________
These ideas are not new (I think) but remain worth writing for me, hoping you’ve gained something. I once thought solving the AI alignment problem was just about having the right target and deploying it via personas. Now, containers and personas align well with current challenges, and I hope working on them will lead to fruitful experiments or research. These concepts come from my writings and experiments since late 2022, when I started tackling complex AI alignment issues. Though progress has been slow and has not helped the field at large, I’ve grown my understanding of the core problems personally during this time. Moving into 2026, I aim to apply these intuitions to practical experiments.
Feel free to share your comments and opinions. Thank you.