Great idea. I help run the AI alignment club at UCSD, I’ll try to organize a group screening and discussion afterward.
Kevin Zhang
Karma: 0
It seems to me that the part of training most responsible for capabilities would be pre-training rather than RL (something like GRPO requires the base model to get at least one rollout correct). But also, it feels like most RL training has to be objective agnostic; a coding task wouldn’t clearly have a clear connection to alignment. If our goal is to train an aligned AI where capabilities and alignment goes hand in hand, it seems like we should somehow bake alignment training into pre-training rather than rely on post-training techniques. Unless, its primarily RL that induces long horizon goal directed capability (I suspect it’s some of both).
This made me curious whether improving LLMs’ ability to Bayesian update could address this? Consider a claim A the LLM assigns P(A), and let B be new information. Perhaps we can construct some kinds of questions where the LLM has to have properly calibrated P(A|B). It’s unclear what questions these would be, but what comes to mind are forecasting questions where recent events move a prediction market (for events past the knowledge cutoff).
But I think updating one belief isn’t enough for coherence you want. We can also maybe do some sort of consistency training, training the model to guarantee constraints like P(A and B) ⇐ P(B), or violations of the law of total probability, across a whole graph of the model’s related beliefs. In effect, these two training objectives could get you a reasoner that can update in response to new information, and propagate that through the rest of what it believes.