This is, by far, the alignment approach I’m most optimistic about—more so than mechanistic interpretability, which feels too narrow to reliably constrain a sufficiently sophisticated actor.
I’ve been thinking about an datasets where the reward function is explicitly coupled to the well-being of an external entity, not merely to semantic or linguistic correctness.
At this point, we aren’t really looking for systems that are better at language. If anything, we appear to be asymptoting on those benchmarks already.
What matters is that there are countless things an AI can say that are linguistically “correct” yet actively degrade well-being.
For instance to run small-scale experiments in which an LLM is tasked with sustaining the well-being of simulated or live organisms. With the goal Being to ground—however imperfectly—its reward signal in the welfare of other entities it is continuously influencing.
In that framing, the model’s world model isn’t shaped as a researcher, optimizer, or abstract thinker, but as a caretaker. There’s the latent assumption that. A sufficiently advanced AI would develop altruism. We should test this on a small scale first.
imagine a colony of bees, mice, or even humans, with the AI tasked with improving their well-being over long time horizons. Not because this would be Particularly difficult—current systems could probably perform extremely well after some fine-tuning It’s to cultivate the sentiment, the inclination, the reflexes.
.Beautifully sad and honest.
I’ve been sitting with a similar dilemma: spending so much of my time reading, thinking, and caring about rationality (and adjacent topics) has led me to live a much lonelier life than I otherwise might have. But for better or worse, I love it, and I’m unlikely to change anytime soon.