Gurkenglas comments on The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?

Gurkenglas 30 May 2025 9:34 UTC
5 points
2
Presumably in some domains its capabilities will generalize better OOD than its tau-classifier (and vice versa). You could try to have it err in the direction of tau in such cases, though neither paper seems to gesture at this.
Now whether things are harmful depends on the capability level. For example, you might trust an AI to send an email to a politician arguing for climate change or peacemaking if it’s human-level, but not if it’s smart enough to tell which second-order effects will dominate, such as inoculating the politician against the arguments, or distracting them from their work on AI regulation, or maneuvering them into drama with another faction.
You could try to put the AI’s capabilities in context, if you know them, so things can be either-harmful-or-not again, though neither paper seems to gesture at this.
Such problems are characteristic of attempts to build an aligned system out of parts that are not, by themselves, aligned; they will search for ways to bypass your system. We could possibly figure out how to build aligned parts.